Google processes over 20 petabytes of data per day

Google currently processes over 20 petabytes of data per day through an average of 100,000 MapReduce jobs spread across its massive computing clusters. The average MapReduce job ran across approximately 400 machines in September 2007, crunching approximately 11,000 machine years in a single month. These are just some of the facts about the search giant’s computational processing infrastructure revealed in an ACM paper by Google Fellows Jeffrey Dean and Sanjay Ghemawat.

Twenty petabytes (20,000 terabytes) per day is a tremendous amount of data processing and a key contributor to Google’s continued market dominance. Competing search storage and processing systems at Microsoft (Dyrad) and Yahoo! (Hadoop) are still playing catch-up to Google’s suite of GFS, MapReduce, and BigTable.

Aug. 2004 Mar. 2006 Sep. 2007 Number of jobs (1000s) 29 171 2,217
Avg. completion time (secs) 634 874 395
Machine years used 217 2,002 11,081
map input data (TB) 3,288 52,254 403,152
map output data (TB) 758 6,743 34,774
reduce output data (TB) 193 2,970 14,018
Avg. machines per job 157 268 394
Unique implementations
map 395 1,958 4,083
reduce 269 1,208 2,418

Google processes its data on a standard machine cluster node consisting two 2 GHz Intel Xeon processors with Hyper-Threading enabled, 4 GB of memory, two 160 GB IDE hard drives and a gigabit Ethernet link. This type of machine costs approximately $2400 each through providers such as Penguin Computing or Dell or approximately $900 a month through a managed hosting provider such as Verio (for startup comparisons).

The average MapReduce job runs across a $1 million hardware cluster, not including bandwidth fees, datacenter costs, or staffing.

Summary

The January 2008 MapReduce paper provides new insights into Google’s hardware and software crunching processing tens of petabytes of data per day. Google converted its search indexing systems to the MapReduce system in 2003, and currently processes over 20 terabytes of raw web data. It’s some fascinating large-scale processing data that makes your head spin and appreciate the years of distributed computing fine-tuning applied to today’s large problems.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s