Cloudy

These days, something relates to software platforms that perform distributed computing on a cluster, catches my attention, and this led me to:

Hadoop platform is just the open-source implementation of Google’s Mapreduce.

I think the most basic ingredient for the this platform is distributed file system. Basically MapReduce framework works in two steps, it Maps and then it Reduces. At the end of the workflow it writes the output to a distributed file system (GFS for Google or HDFS for Hadoop). GFS is proprietory to Google, and it’s implemented in userspace as opposed to be in kernel. Please find Google Research Publication for GFS here.

Some people say that the implementation is low-level and some tried to add more layer to original implementations. For example, Facebook layered Hive on Hadoop engine.

MapReduce framework is supposed to handle huge amount of data, so in general we will need a data structure that can hold/process this amount of data comfortably. Google implemented BigTable, and HBase is the open-source alternative from Hadoop.

I think I’ll look into Hadoop (Java implementation) and Qt Concurrent (Qt C++ implementation) of MapReduce.

Last.fm’s bashreduce look interesting, too.

Short notes on Linux Libraries

Libraries are the compiled code that is usually incorporated into a programer at a later time.

  • Three types: Static Libraries, Shared Libraries, and Dynamically Loaded Libraries
  • Static libraries are a collection of normal object files.
  • They usually ends with “.a”.
  • Collection is created with “ar” command.
  • Shared libraries are loaded at program start-up and shared between programs.
  • Dynamically loaded libraries can be loaded and used at any time while a program is running.
  • DL libraries are not really in any kind of library format.
  • Both static and shared libraries can be used as DL libraries.

Linux Processes and CPU Performance

In Linux, a process can be either:

  • runnable, or
  • blocked (awaiting some events to complete)

When it’s runnable, the process is in competition with other processes for CPU time. A runnable process may or may not be consuming CPU time. It is the CPU scheduler that decides which process to run next from the runnable processes list. The processes form a line, known as run queue, when they are waiting to use the CPU.

When it’s blocked, it may mean it’s waiting for data from IO device or the results of a system call.

System usually shows the load by totalling the running processes and the runnable processes.

Multitasking
When it comes to multitasking, the OS can be:

  • cooperative multitasking, or
  • preemptive multitasking

In preemptive multitasking, scheduler gives the processes time slices for CPU. The process will be involuntarily suspended after it has consumes the allocated time. It prevents one process from monopolizing the available CPU time.

In cooperative multitasking, the process will not stop running until it is voluntary. When it suspends itself, it is called yielding. The scheduler cannot make decision how long the process should run.

Scheduler
Starting from kernel 2.5, Linux gets itself a new scheduler, O(1). Now it’s been replaced with CFS, as I’ve wrote about it in my earlier posts.

Tools to view the CPU performance
I usually use these tools to check:

  • vmstat
  • top

Those tools are quite basic, yet are able to produce pretty good information, and they come with almost every distro.

vmstat, I would check the number interrupts fired (in), the number of context switches (cs), as well as CPU utilization such as User (us), System (sy), Idle (id). I expect to see lower “cs” than “in”. I’ll try to explain the context switches and the interrupts in my future posts. For the time being, kindly google for them.

top, version 3 produces more stats. We can check the states of the processes, as well as the user cpu stats, system cpu stats (softirq, iowait, irq).

Linux Package Management

I always like to play around with new distros that I can find from distrowatch.com. Gentoo being my primary distribution, I have Arch as my second distribution. Arch also offers the flexible system. Almost every linux systems are the same in functionality and the features, and from as far as I can see, the only difference arethat how they implement the front-ends, and how they manage the packages.

With Gentoo, I am not being fancied by easy or pretty front-ends (you can say Gentoo text output is quite colorful), but I’m more interested in how to add/remove/update new software package onto the system. I don’t think anyone will content with the packages that comes with the distro. Package Management offers various ways to install/remove the software as well as update one package or the whole system. It also allows us to select software repositories which we download the packages from. These are some package management systems that usually tied to a distro and its variants:

apt-get for Debian, Ubuntu, etc.
emerge for Gentoo, Sabayon
yum for Fedora, etc.

For more information about the package management systems for linux distributions, you can always refer to those good documents:

“Cannot find -lGl”

I’m setting up another Gentoo on my office desktop.

My compiler on Gentoo stops at this error when compiling certain packages. I just noticed that as I tried to install binary drivers for ATI card (X1300).

I tried to re-emerge GCC, but it didn’t help. But finally I noticed the cause as I searched through the forum and the net.

It’s because of the missing symbolic to libGL.so in /usr/src.

I was using ATI drivers for OpenGL so my Gentoo symbolically linked to that library. “dri” was not working so I tried to downgrade the driver to lower version by uninstalling it. But when I did a re-install, compiler stopped there, saying it “Cannot find -lGl”.

Actually what I should have done is to switch the OpenGL library to Xorg libraries before uninstalling the ATI drivers.

Now I’m able to compile the ATI drivers, but still “dri” is not working. I need to figure it out.