转贴:Hadoop’s code structure
2008-07-08 15:25 分类:Hadoop
Apache Hadoop Wins Terabyte Sort Benchmark: "One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won."
Amazing. You can see the history of the benchmark here. Just as interesting is that this is a fraction of Y!'s capacity - they're running 13,000+ Hadoop nodes. What does it mean? I think that Hadoop is becoming an important piece of open source infrastructure for data processing, and potentially as a storage system in its own right. Core Hadoop development is highly active, with a string of associated projects building on it, such as Pig, Mahout, HBase, Hive, Cascading and Zookeeper.
I think Hadoop is an incredible project. Crudely you can split Hadoop into two parts - compute and storage. There's a lot more going on than that of course - metrics, scheduling, job tracking and so on, cool stuff that doesn't get as much publicity as the map/reduce feature, such as the ability to be rack aware and potential features like pause/resume - the totality of what Hadoop does is amazing. Ideally then, if you
buy into the fundamental data/compute split, there'd be clean layering - for
example the filesytem layers would have no upward dependency on the
map/reduce layer, which is a processing library.
I've been digging around the Hadoop code recently; I used Structure 101 from the folks at Headway Software to help me scan the codebase and its package architecture. Aside from
being a good way to get my head around Hadoop's innards, the results
were interesting. The basic code metrics:
- Packages (that contain classes): 58
- Classes (outer): 611
- Classes (all): 1,246
- LOC (Non Comment Non Blank Lines Of Code): ~118K