These days, something relates to software platforms that perform distributed computing on a cluster, catches my attention, and this led me to:

  • Hadoop/Hbase/Pig and
  • Mapreduce/Sawzall/Bigtable

Hadoop platform is just the open-source implementation of Google’s Mapreduce.

I think the most basic ingredient for the this platform is distributed file system. Basically MapReduce framework works in two steps, it Maps and then it Reduces. At the end of the workflow it writes the output to a distributed file system (GFS for Google or HDFS for Hadoop). GFS is proprietory to Google, and it’s implemented in userspace as opposed to be in kernel. Please find Google Research Publication for GFS here.

Some people say that the implementation is low-level and some tried to add more layer to original implementations. For example, Facebook layered Hive on Hadoop engine.

MapReduce framework is supposed to handle huge amount of data, so in general we will need a data structure that can hold/process this amount of data comfortably. Google implemented BigTable, and HBase is the open-source alternative from Hadoop.

I think I’ll look into Hadoop (Java implementation) and Qt Concurrent (Qt C++ implementation) of MapReduce.

Last.fm’s bashreduce look interesting, too.