第二次Hadoop中国沙龙议程安排
2009-05-05 19:33 分类: Hadoop
转载至: http://hadooper.cn
时间
2009年5月9日(星期六)
地点
玉泉路19号乙 主楼B326会议室)
位置信息:Google Map
公交信息:搜易吧
议程
10:00 am - 10:20 am
Welcome Speech 致欢迎词
YaoDong Cheng (Institute of High Energy Physics, CAS)
10:20 am – 11:00 am
Session 1 Hive/Data Warehousing
Hive-Progress and Future: Open-Source Solution for Data Warehousing
Zheng Shao (Facebook Inc)
Free discussion around this topic
11:00 am – 11:30 am
Experience of Studying Column Based Storage on Hadoop, and Future Work
Huaming Liao/Yongqiang He (ICT,CAS)
Free discussion
11:30 am– 12:20 am
Session 2 Hadoop In Baidu
Scheduler used in Baidu, Computing Security and Data Security
Shouyan Wang (Baidu Inc)
Free discussion
中餐、午休 (12:00 -1:00,提供中餐和饮料)
1:00 pm – 2:30 pm
Session 3 Hadoop From Yahoo!
Hadoop Roadmap (10 min)
Preeti from Yahoo R&D India (Video Conference, in English)
Feature Highlights-Scheduler, MR & Avro (45 min)
Sharad from Yahoo R&D India (Video Conference, in English)
Machine Learning and Data Mining using Hadoop and PIG
Yahoo R&D Beijing
Free Discussion
2:30 pm – 3:00 pm
Session 3 Hadoop on GIS
Hadoop on GIS
Jizhong Han/Kai Wang
3:00 pm – 3:30 pm
Session 4 Profiling
Profiling Hadoop with X-Trace and Ganglia
Sirui Yang (Intel Research)
3:30 pm – 4:30 pm
Free Discussion
Many thanks to all hadoop in china salon committees!
第二次Hadoop中国沙龙欢迎您的参与或演讲
2009-04-28 10:37 分类: Hadoop
此次沙龙将于5月9日在中国科学院高能物理所举办, 地址是北京市石景山区玉泉路。报名请发邮件至: heyongqiang@software.ict.ac.cn
以下是原始文件:
Call For Talk&Participation
The second Hadoop in China Salon
Hadoop in China Salon is a free discussion forum on Hadoop related technologies and ideas.
Five months ago, in Nov 2008, we successfully concluded the first hadoop salon in Beijing. In the previous salon, more than sixty people attended. Hao Zheng (Yahoo! Inc), Zheng Shao (Facebook Inc) and Wang ShouYan (Baidu Inc) from the industry gave us very impressive talks on their recent progresses on Hadoop. And about thirty people (students/professors) are from Universities and Institutes, who also impressed us deeply. Thank you all the attendees again. Without you, it would never success.
In early May this year, we are going to host the second Hadoop in China Salon. It is our great honor to invite you again to take part in the salon. The meeting is scheduled on May 9 (the first weekend after Labor Day Holiday). And this time we are trying to hold it in iHep(Institute of High Energy Physics, Chinese Academy of Science). The good part is that we can visit the Beijing Electron Positron Collider (BEPC, the biggest EPC in China). The bad part is that it may be a little far from Zhong GuanCun, and it is on Yu Quan Road.
We now welcome talkers for this salon with our greatest sincerity. Please share with us the latest progress in your work or your team’s work on Hadoop. If you are interested in giving a talk, please drop me an email. And we expect more hadoop users and developers to join us this time. Please feel free to come to the Hadoop discussion forum. Please drop me an email if you would like to come, so we can prepare more free drink and foods.
If you have any thoughts on this meeting, please let us know by dropping me an email.
BTW, we have collected several fantastic talks so far, including:
1) One by Zheng Shao on the recent progress on Hive.
2) Two or more talks organized by Yahoo! China R&D (Thanks Yahoo! India R&D, Yahoo! China R&D and Hao Zheng ). One talk is research on machine learning. The other talk is not finally settled, it would be about hadoop roadmap, or enhancements, or new scheduler or Pig.
3) One talk from Baidu Inc. The talk will cover the hadoop scheduler used in Baidu, data security and computing security.
4) Two small talks from two teams from ICT (Institute of Computing Technology, Chinese Academy of Science). One is about our research effort on data organization and their influences. The other is about Hadoop On GIS.
Thanks for all the talkers.
Many thanks to Cheng Yaodong from ihep(ihep.ac.cn) for providing the venue and infrastructure.
From: Yongqiang He
Hadoop中文文档
2008-12-03 12:10 分类: Hadoop
Hadoop的0.17版本的中文文档,来自Hadoop的jira。各位可以不用再啃那一砣英文了。在此,感谢阿里巴巴的Distributed Computing部门的闫同学和他的团队所做的贡献。
Update:Hadoop的0.18版本的中文文档。
谈谈Hadoop和分布式Lucene
2008-11-14 01:30 分类: Hadoop, MapReduce, Relative
Lucene是大家用的最多的开源搜索引擎。本文不探讨Lucene如何实时更新(http://issues.apache.org/jira/browse/LUCENE-1313),和如何修改Lucene评分机制,添加如PageRank评分因子,本文只讨论分布式的Lucene。

说到Lucene一般都会提到Nutch,Hadoop最早是Doung Cutting为了Nutch的crawler和indexer所开发的做为nutch的两个package。Hadoop在Nutch中的作用就是抓取页面和建立索引。其抓取和建索引详见页面。因为Hadoop的seek能力限制,Nutch的分布式搜索使用手动配置的机制,缺少管理索引能力和服务器的机制。具体步骤:在webserver中修改search-servers.txt把搜索服务的服务器地址和服务端口添加进去,然后把nutch-site.xml中的searcher.dir指到search-servers.txt保存的目录,在提供搜索服务的服务器上手动的从HDFS中拷贝索引文件到本地。启动DistributedSearch.Server提供搜索服务。Nutch节点失效通过搜索请求IPC调用的超时来通知。
Hadoop中的子项目Zookeeper能做什么
2008-08-20 16:09 分类: Hadoop, Relative
很高兴得看到Yahoo捐献的Zookeeper已经从sourceforge迁移到Apache,并成为Hadoop的子项目.那么ZooKeeper是什么呢?Zookeeper是Google的Chubby一个开源的实现.是高有效和可靠的协同工作系统.Zookeeper能够用来leader选举,配置信息维护等.在一个分布式的环境中,我们需要一个Master实例或存储一些配置信息,确保文件写入的一致性等.Zookeeper能够保证如下3点:
- Watches are ordered with respect to other events, other watches, and
asynchronous replies. The ZooKeeper client libraries ensures that
everything is dispatched in order. - A client will see a watch event for a znode it is watching before seeing the new data that corresponds to that znode.
- The order of watch events from ZooKeeper corresponds to the order of the updates as seen by the ZooKeeper service.
按此阅读全文 Hadoop中的子项目Zookeeper能做什么 »
函数式编程范式-MapReduce
2008-08-07 16:48 分类: Hadoop, MapReduce
一个月前有人问我什么是函数式编程?虽然熟悉一些函数式编程的概念,那本半年前从托人从加拿大买的The Little Schemer也就看了前面几章,那天就是回答不了究竟什么是函数式编程。函数式编程对于熟悉过程式程序设计的程序员来说是一个陌生的领域,闭包(closure),延续(continuation),和柯里化(currying)等概念对于过程式程序设计的程序员是个噩梦。
Without understanding functional programming, you can't invent MapReduce,the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. The very fact that Google invented MapReduce, and Microsoft didn't, says something about why Microsoft is still playing catch up trying to get basic search features to work, while Google has moved on to the next problem: building Skynet^H^H^H^H^H^H the world's largest massively parallel supercomputer. I don't think Microsoft completely understands just how far behind they are on that wave.
上段内容摘自 Joel Spolsky的Blog,明白的解释了函数式编程模型是MapReduce的灵感。
MapReduce的名字源于函数式编程模型中的两项核心操作:Map和Reduce。也许熟悉Functional Programming(FP)的人见到这两个词会倍感亲切。因为Map和Reduce这两个术语源自Lisp语言和函数式编程。Map是把一组数据一对一的映射为另外的一组数据,其映射的规则由一个函数来指定。Reduce是对一组数据进行归约,这个归约的规则由一个函数指定。Map是一个把数据分开的过程,Reduce则是把分开的数据合并的过程。如Hadoop的wordcount例子:用Map把[one,word,one,dream]进行映射就变成了[{one,1}, {word,1}, {one,1}, {dream,1}],再用Reduce把[{one,1}, {word,1}, {one,1}, {dream,1}]归约变成[{one,2}, {word,1}, {dream,1}]的结果集。
Hadoop的HDFS
2008-07-31 13:12 分类: Hadoop
HDFS的设计思想:
构建一个非常庞大的分布式文件系统。在集群中节点失效是正常的,节点的数量在Hadoop中不是固定的.单一的文件命名空间,保证数据的一致性,写入一次多次读取.典型的64MB的数据块大小,每一个数据块在多个DN(DataNode)有复制.客户端通过NN(NameNode)得到数据块的位置,直接访问DN获取数据。
NameNode功能:
映射一个文件到一批的块,映射数据块到DN节点上。集群配置管理,数据块的管理和复制。处理事务日志:记录文件生成,删除等。因为NameNode的全部的元数据在内存中存储,所以NN的内存大小决定整个集群的存储量。NN内存中保存的数据:
- 文件列表
- 每一个文件的块列表
- 每一个DN中块的列表
- 文件属性:生成时间,复制参数,文件许可(ACL)
转贴:Hadoop’s code structure
2008-07-08 15:25 分类: Hadoop
Apache Hadoop Wins Terabyte Sort Benchmark: "One of Yahoo's Hadoop clusters sorted 1 terabyte of data in 209 seconds, which beat the previous record of 297 seconds in the annual general purpose (daytona) terabyte sort benchmark. The sort benchmark, which was created in 1998 by Jim Gray, specifies the input data (10 billion 100 byte records), which must be completely sorted and written to disk. This is the first time that either a Java or an open source program has won."
Amazing. You can see the history of the benchmark here. Just as interesting is that this is a fraction of Y!'s capacity - they're running 13,000+ Hadoop nodes. What does it mean? I think that Hadoop is becoming an important piece of open source infrastructure for data processing, and potentially as a storage system in its own right. Core Hadoop development is highly active, with a string of associated projects building on it, such as Pig, Mahout, HBase, Hive, Cascading and Zookeeper.
I think Hadoop is an incredible project. Crudely you can split Hadoop into two parts - compute and storage. There's a lot more going on than that of course - metrics, scheduling, job tracking and so on, cool stuff that doesn't get as much publicity as the map/reduce feature, such as the ability to be rack aware and potential features like pause/resume - the totality of what Hadoop does is amazing. Ideally then, if you
buy into the fundamental data/compute split, there'd be clean layering - for
example the filesytem layers would have no upward dependency on the
map/reduce layer, which is a processing library.
I've been digging around the Hadoop code recently; I used Structure 101 from the folks at Headway Software to help me scan the codebase and its package architecture. Aside from
being a good way to get my head around Hadoop's innards, the results
were interesting. The basic code metrics:
- Packages (that contain classes): 58
- Classes (outer): 611
- Classes (all): 1,246
- LOC (Non Comment Non Blank Lines Of Code): ~118K
按此阅读全文 转贴:Hadoop’s code structure »
Hadoop赢得1TB排序基准评估第一名
2008-07-04 12:16 分类: Hadoop
强烈祝贺Hadoop赢得1TB排序基准评估第一名。Yadoo的一个集群最近用209秒时间排序1TB的数据,比上一年的的纪录保持者保持的297秒快乐将近90秒。1998年Jim Gray创建了排序基准评估的方法,建立100亿条100个字节的纪录,评估对这100亿条纪录完全排序和把纪录写入磁盘的时间。评估是建立在未发布的版本0.18上的。排序所用的源码在这个地址。
微软收购Powerset后HBase的项目还会继续吗?
2008-07-03 22:05 分类: HBase
七月一号,微软买下了自然语言处理方面的公司Powerset(HBase的主要贡献者).据传言花了一亿美金.Powerset将会加入微软的Live Search团队.Powerset的语义搜索是建立在开源软件Hadoop上.看来微软在追赶Google上是不惜血本.Google现在的搜索还是基于用户输入的单词,而Powerset的语义理解更能体现用户意图,如果你输入inn,那么搜索引擎会返回搜索引擎对inn的解释和理解.在这一点上和旅游搜索引擎uptake.com一样.现在的搜索引擎并不理解"shrub"和"tree"在概念上基本相同,"cancer"表示癌症,可又表示巨蟹座,微软希望Powerset能帮助解决这些问题,以便搜索引擎能更好的体现用户意图。
那么微软下的Powerset还会继续贡献2.0,3.0版本的Hbase吗?我不知道.但是我们还有另一个选择:hypertable.