« Hadoop的HDFS | 首页 | Hadoop中的子项目Zookeeper能做什么 »

函数式编程范式-MapReduce

作者:马士华 发表于:2008-08-07 16:56 最后更新于:2008-11-28 09:47
版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息。
http://www.hadoop.org.cn/hadoop/functional-programming-mapreduce/

一个月前有人问我什么是函数式编程?虽然熟悉一些函数式编程的概念,那本半年前从托人从加拿大买的The Little Schemer也就看了前面几章,那天就是回答不了究竟什么是函数式编程。函数式编程对于熟悉过程式程序设计的程序员来说是一个陌生的领域,闭包(closure),延续(continuation),和柯里化(currying)等概念对于过程式程序设计的程序员是个噩梦。

Without understanding functional programming, you can't invent MapReduce,the algorithm that makes Google so massively scalable. The terms Map and Reduce come from Lisp and functional programming. MapReduce is, in retrospect, obvious to anyone who remembers from their 6.001-equivalent programming class that purely functional programs have no side effects and are thus trivially parallelizable. The very fact that Google invented MapReduce, and Microsoft didn't, says something about why Microsoft is still playing catch up trying to get basic search features to work, while Google has moved on to the next problem: building Skynet^H^H^H^H^H^H the world's largest massively parallel supercomputer. I don't think Microsoft completely understands just how far behind they are on that wave.

上段内容摘自 Joel Spolsky的Blog,明白的解释了函数式编程模型是MapReduce的灵感。

MapReduce的名字源于函数式编程模型中的两项核心操作:Map和Reduce。也许熟悉Functional Programming(FP)的人见到这两个词会倍感亲切。因为Map和Reduce这两个术语源自Lisp语言和函数式编程。Map是把一组数据一对一的映射为另外的一组数据,其映射的规则由一个函数来指定。Reduce是对一组数据进行归约,这个归约的规则由一个函数指定。Map是一个把数据分开的过程,Reduce则是把分开的数据合并的过程。如Hadoop的wordcount例子:用Map把[one,word,one,dream]进行映射就变成了[{one,1}, {word,1}, {one,1}, {dream,1}],再用Reduce把[{one,1}, {word,1}, {one,1}, {dream,1}]归约变成[{one,2}, {word,1}, {dream,1}]的结果集。

Map和Reduce操作是独立的对每个元素进行操作,操作是没有副作用的。Reduce是对N个Map结果进行归约,也就是相当于Map[1,2,..n]的结果是Reduce的函数的参数。在一个命令式语言中求值顺序是确定,每个函数都有可能会变更或依赖于外部状态,所以就必须有序的执行这些函数。只要确保没有函数修改或依赖于全局变量,N个Map的执行的顺序对于MapReduce模型来说可以是非有序的,也可以把大规模的运算并行处理。所以MapReduce也是并行的,大规模的运算相对独立。对于Map和Reduce函数中所需要的数据,在高阶函数中可以不必一次全部读取,每次读取函数所需要的一个数据即可,从这种意义上来说,符合惰性求值的概念。

在函数式编程中,操作函数的函数被称为高阶函数(high-order function)。MapReduce模型正是操作用户代码的函数。只要把Map和Reduce这两个高阶函数进行并行化处理,便形成了一个以Map和Reduce为基础的并行运算框架。具体应用相关代码写在用户代码中,之后与MapReduce结合获得并行处理的能力,所有的复杂性都在模型中隐藏了。

正是函数式编程中的概念--软件更容易组合,并行,操作是没有副作用,惰性求值,高阶函数--造就了MapReduce模型。也就有了这句:Without understanding functional programming, you can't invent MapReduce。

敝人才疏学浅,也希望前辈们能多多指点。学了这么多,该交个作业。用函数式编程实现MapReduce(Javascript edition)。

/*
 * MapRed - A MapReduce model implement of Javascript.
 *
 * author:josh ma
 * email:beijing.josh@gmial.com
 * Copyright (c) 2008 hadoop.org.cn
 * Dual licensed under the MIT (MIT-LICENSE.txt)
 * and GPL (GPL-LICENSE.txt) licenses.
 *
 *
 * $Date: 2008-8-11 14:05 AM;
 */
 
//if client did not have FireBug
 
if(!window["console"]){
   console={
    log:function(s){
        alert(s)
    }
   }
}
 
/*
a currying functions.
original url: http://www.coryhudson.com/blog/2007/03/10/javascript-currying-redux/
I modify some thing.
*/
 
function curry(f, n) {
    if (typeof(n) == 'undefined') n = f.length;
    var args = arguments;
    var _this = this;
    return function () {
        if (arguments.length >= n) {
            return f.apply(_this, arguments);
        } else {
            var a = [].slice.call(arguments,0);
            return args.callee.call(_this,function () {
                return f.apply(_this, a.concat([].slice.call(arguments,0)))
            },n - a.length);
        }
    };
}
 
//extend the property and function of s to d
 
function extend(d, s) {
  for (p in s) {
    d[p] = s[p];
  }
  return d;
}
 
//The iterator of object and array.
var each = function (o, f, args){
   if(Object.prototype.toString.apply(o) === '[object Array]')
       for ( var i = 0, ol = o.length , val = o[0];
           i < ol &&f.apply(val, args ? args : [i,val]) !== false; val = o[++i] );
   else
       for ( var i in o )
           f.apply( o[i], args ?  args : [i, o[i]]);
}
 
//The MapReduce object
var mapred = {};
 
extend(mapred,{
 
    pro :{
        load:function(){return[];},
        save:function(){return;},
        map:null,
        reduce:null,
        combiner:null
    },
 
    //mapred configuration
 
    configuration:function(pro){
        this.pro = extend(this.pro,pro)
    },
 
    //cache the data
 
    cache:{},
 
    //curried this method,the first parameter is given by this.run function.
    //the remains is wait for mapred write data.
 
    emit:curry.call(mapred,function(met,key,value){
        if(!this.cache[met]){
            this.cache[met] = {};
        }
        if(!this.cache[met][key]){
            this.cache[met][key] = [];
        }
        this.cache[met][key].push(value);
    }),
 
    /*
      the mapred main function,a high-order function,
      expect less then one parameter.
      client execute this function without given less then one parameter,
      return a function to waiting util be given one parameter to run this method.
    */ 
 
    run:curry.call(mapred,function(map,reduce,combine){
 
        var emit = this.emit;
        //iterator the input data
        each(this.pro.load(),function(kay,value){
            map(kay,value,emit("map"));
        });
 
        //get the cache of map data
        var data = this.cache["map"];
        if(combine){
            //iterator the map data
            each(data,function(key,value){
                combine(key,value,emit("combine"));
            });
            delete this.cache["map"];
        }
 
        //get the cache of map or combiner data
        data = combine ? this.cache["combine"] : this.cache["map"];
        if(reduce){
            //iterator the map data or combiner data
            each(data,function(key,value){
                reduce(key,value,emit("reduce"));
            });
            delete combine ? this.cache["combine"] : this.cache["map"];
        }
 
        //call save function to save data;
        this.pro.save(reduce ? this.cache["reduce"] : this.cache["map"]);
        delete this.cache["reduce"];
        //after save data,release the cache;
        this.cache = {};
 
    },1),
 
});
//end MapRed
 
//The input of data,like the InputFormat of Hadoop
 
var outload = function(){
    return ["one world,one dream!","hello,world!"];
}
 
//Save the Result,like the OutputFormat of Hadoop
 
var outsave = function(data){
    each(data,function(index,value){
       console.log(index +"\t" + value + "\n");
    });
}
 
//A map/reduce job configuration.like JobConf of Hadoop
 
mapred.configuration({load:outload,save:outsave});
 
//Map function
 
var map = function(key,value,emit){
     strs = value.split(/\s+|!|,/);
     each(strs,function(key,value){
        if(value)
            emit(value,1);
     });
}
//Reduce function
 
var reduce = function(key,value,emit){
     var v = 0;
     each(value,function(index,value){
        v += value;
     });
     emit(key,v);
}
 
//Sumbit the job,Like The JobClient.runJob method of Hadoop
 
//mapred.run()()()(map)
//mapred.run(map)
mapred.run(map,reduce)
//mapred.run(map,reduce,reduce)

运行代码实例

Reference:

函数式编程: 函数式编程另类指南

梦想风暴的文章: MapReduce

Functional Programming : Functional Programming and MapReduce


相关文章

引用通告

如果您想引用这篇文章到您的Blog,
请复制下面的链接,并放置到您发表文章的相应界面中。
http://www.hadoop.org.cn/hadoop/functional-programming-mapreduce/trackback/

Comments

3 Comments to “函数式编程范式-MapReduce”

  1. Kevin Wan on 2008-11-27 10:13 pm
    Gravatar

    你的js代码展示运行有error。

  2. 马士华 on 2008-11-28 10:39 am
    Gravatar

    哦,我修改了代码,原来each功能中的是
    if(o.constractor == Array) 。但是这个方法可能一个问题,如果一个数组是来自另一个frame中的,那么它的constructor 将是另一个对象。见http://blog.360.yahoo.com/blog-TBPekxc1dLNy5DOloPfzVvFIVOWMB0li?p=916。
    我要改成if(Object.prototype.toString.apply(o) === ‘[object Array]‘),错打成if(Object.prototype.toString.apply(o) !== ‘[object Array]‘),谢谢。

  3. Idea Tita… » Blog Archive » 函数式编程范式-MapReduce on 2008-11-30 11:54 pm

Leave a Reply