It’s In That Place I Put That Thing That Time

If you’ve been on any cloud or virtualization site lately, you’ve probably seen the picture of a bright yellow elephant staring you down. The Hadoop elephant to be precise. Apache’s Hadoop is considered to be one of the most important technologies in the transition to large scale cloud environments. In fact, Yahoo! has been the largest contributor to Hadoop and uses it across their entire organization, as does Facebook. So here is a brief introduction to this large-scale software framework.

Apache’s Hadoop is a framework that was created by Doug Cutting, who develop it to support a search engine project called Nutch (Hadoop was the name of his son’s toy elephant in case you were wondering). Hadoop allows applications to work with extremely large amounts of data (petabytes) using a java based platform. With contributors from around the globe, the opensource Hadoop Common JAR package allows organizations to map their filesystems and replicate data using the Hadoop Distributed File System (HDFS). This means that if there is a rack power outage or switch failure, there is a chance that the data may still be accessible.

So how does Hadoop work? Without going too much into detail (check out Apache Hadoop’s website for all the details), Hadoop uses HDFS to replicate data and break it up into smaller datanodes and distributes them across multiple hosts. You can set the number of copies and use the datanodes to talk to each other to continually load balance the data across the hosts. Unfortunately, because of the characteristics of HDFS, it cannot provide high availability due to the filesystem instance requiring a “namenode” which contains a directory tree of all files in the system and tracks its current location. There is only one version of the namenode and if the system goes down, it can take long periods of time for the namenode to restore itself. The upside of having this design is the increased performance of data output and the ability to handle large files.

There are numerous other advantages of the Apache Hadoop platform, many of which are too technical to get into at a high level. But you can visit the following sites for more information on this project.

http://hadoop.apache.org
http://wiki.apache.org/hadoop/FrontPage
http://en.wikipedia.org/wiki/Apache_Hadoop

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s