Table of Contents
Hadoop is Apache software so it is freely available for download and use. So why do we need distributions at all?
This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and Ubuntu. The software is free to download and use but distributions offer an easier to use bundle.
So what do Hadoop distros offer?
The Apache version of Hadoop is just TAR balls. Distros actually package it nicely into easy to install packages which make it easy for system administrators to manage effectively.
The Hadoop ecosystem contains a lot of components (HBase, Pig, Hive, Zookeeper, etc.) which are being developed independently and have their own release schedules. Also, there are version dependencies among the components. For example version 0.92 of HBase needs a particular version of HDFS.
Distros bundle versions of components that work well together. This provides a working Hadoop installation right out of the box.
Distro makers strive to ensure good quality components.
Sometimes, distros lead the way by including performance patches to the 'vanilla' versions.
Distros have predictable product release road maps. This ensures they keep up with developments and bug fixes.
Lot of distros come with support, which could be very valuable for a production critical cluster.
Table 11.1. Hadoop Distributions
Distro | Remarks | Free / Premium |
---|---|---|
Apache hadoop.apache.org |
| Completely free and open source |
Cloudera www.cloudera.com |
| Free / Premium model (depending on cluster size) |
HortonWorks www.hortonworks.com |
| Completely open source |
MapR www.mapr.com |
| Free / Premium model |
Intel hadoop.intel.com |
| Premium |
Pivotal HD gopivotal.com |
| Premium |
Elephants can really fly in the clouds! Most cloud providers offer Hadoop.
Hadoop clusters can be set up in any cloud service that offers suitable machines.
However, in line with the cloud mantra 'only pay for what you use', Hadoop can be run 'on demand' in the cloud.
Amazon offers 'On Demand Hadoop', which means there is no permanent Hadoop cluster. A cluster is spun up to do a job and after that it is shut down - 'pay for usage'.
Amazon offers a slightly customized version of Apache Hadoop and also offers MapR's distribution.