A Walk in the Clouds

I’ve moved this site over to the cloud servers, by Rackspace from my previous shared host. Actually I was looking for a cloud server and cloud space so that I can play with Hadoop. I found Amazon EC servers and S3, but their services charges are expensive for me. While searching for alternatives, CloudServers caught my attention.

It is cheaper than Amazon services, but at the moment I don’t think I can test Hadoop on CloudServer and with CloudSpace. I’m using it more like a virtual private server, that gives me “root” access. The good thing is you can modify the resources as you wish, so I would say it’s quite scalable. You are also charged by hours (uptime). Rackspace will also charge you even if you turn off the machine. They will not charge after we have deleted the server. If you want to test something for a project, you can just subscribe for desired amount of memory and disk space. And delete the server after it’s been used. We will only be charged for those period. That’s the flexibility that I prefer.

I’ll see what I can do with my server, and update the blog again.

Cloudy

These days, something relates to software platforms that perform distributed computing on a cluster, catches my attention, and this led me to:

Hadoop platform is just the open-source implementation of Google’s Mapreduce.

I think the most basic ingredient for the this platform is distributed file system. Basically MapReduce framework works in two steps, it Maps and then it Reduces. At the end of the workflow it writes the output to a distributed file system (GFS for Google or HDFS for Hadoop). GFS is proprietory to Google, and it’s implemented in userspace as opposed to be in kernel. Please find Google Research Publication for GFS here.

Some people say that the implementation is low-level and some tried to add more layer to original implementations. For example, Facebook layered Hive on Hadoop engine.

MapReduce framework is supposed to handle huge amount of data, so in general we will need a data structure that can hold/process this amount of data comfortably. Google implemented BigTable, and HBase is the open-source alternative from Hadoop.

I think I’ll look into Hadoop (Java implementation) and Qt Concurrent (Qt C++ implementation) of MapReduce.

Last.fm’s bashreduce look interesting, too.