Tuesday, September 28, 2010

Analyzing big data

I've tried for some time now to find the best way to analyze big data. I've used different map-reduce systems, nosql databases and even a custom solution based on independent files for each user data. Not successfully until now. The custom solution is limited by the large number of files and there is no operating system Today that can efficiently handle those, even though the files are grouped in folders based on user IDs. Other systems like Cassandra or MongoDB seem to need a very large number of node on a cluster and are more appropriate to serve real-time results. What I need is an offline solution able to process huge information.
I was thrilled participate to a great presentation at Web 2.0 Expo in New York, where Kevin Weil from Twitter talked about the way Twitter handles big data. Twitter analysis is pretty much my use case, at a smaller scale.
The challenges are: collect the data, large-scale storage and analysis, rapid learning over big data.
Twitter has about 12 terabytes of data per day for a total of 4 petabytes a year.
They used syslog-ng at first, but it didn't scale. Facebook had the same problems so they developed and open sourced Scribe, which is a log collection framework. Twitter and Facebook are working together to make it better.
Twitter is using HDFS (Hadoop file system) to hold the data. HDFS is a distributed fault tolerant operating system, works very well over a cluster of commodity servers. It ensures automatic replication and transparently reads and writes across multiple machines. Each file has three extra copies in the cluster.
Yahoo is using also HDFS on a 4000 nodes cluster, able to sort one terabyte of random integers in about 62 seconds.
HDFS is now very easy to install, RPMs are available from Cloudera.
Here is a typical scenario of Hadoop map reduce utilization over HDFS at Twitter: to get the total number of tweets per day from key-value pairs (user ID and tweet info), map output key user ID with a value of 1, shuffle sort by user ID, reduce for each user using a sum, then output the user ID and the tweet count. Of course, doubling the machines practically doubles the speed of this query.
Currently Twitter counts about 160 million users and 20 billion tweets, this is a lot of data.
Because map reduce utilization is somewhat complex in native implementations like Java, they are using a high level language built on top of Hadoop, called Pig, which was developed by Yahoo and is open source.
Pig allows a considerable reduction of code, about 5% of Java for instance, this means of course 5% of development time. There is still a performance penalty, but it is not major, about 20% and going down with each release (first implementation was 110%).
All Twitter data is currently stored on a 100 nodes cluster and they plan on going to 3 or 4000 nodes. Pretty amazing, not a lot of servers for all tweets.
I will definitely try Hadoop and Pig and get back with some metrics.
Here is a list of queries Twitter are conducting on their data:

  • Requests per day
  • Average latency
  • Response distribution per hour
  • Twitter searches per day
  • Unique users
  • Links per day per domain
  • Geographic distribution
  • Usage difference for mobile users
  • What features get users hooked
  • What can we tell from a user's tweets
  • Language detection
  • Duplicates detection, anti-spam

It seems that Amazon is offering a cloud based hosting solution using Hadoop. Something to consider also.

No comments:

Post a Comment