Chapter 11. Hadoop Distributions

Table of Contents

11.1. The Case for Distributions
11.2. Overview of Hadoop Distributions
11.3. Hadoop in the Cloud

11.1. The Case for Distributions

Hadoop is Apache software so it is freely available for download and use. So why do we need distributions at all?

This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and Ubuntu. The software is free to download and use but distributions offer an easier to use bundle.

So what do Hadoop distros offer?

Distributions provide easy to install mediums like RPMs

The Apache version of Hadoop is just TAR balls. Distros actually package it nicely into easy to install packages which make it easy for system administrators to manage effectively.

Distros package multiple components that work well together

The Hadoop ecosystem contains a lot of components (HBase, Pig, Hive, Zookeeper, etc.) which are being developed independently and have their own release schedules. Also, there are version dependencies among the components. For example version 0.92 of HBase needs a particular version of HDFS.

Distros bundle versions of components that work well together. This provides a working Hadoop installation right out of the box.


Distro makers strive to ensure good quality components.

Performance patches

Sometimes, distros lead the way by including performance patches to the 'vanilla' versions.

Predictable upgrade path

Distros have predictable product release road maps. This ensures they keep up with developments and bug fixes.

And most importantly . . SUPPORT

Lot of distros come with support, which could be very valuable for a production critical cluster.

11.2. Overview of Hadoop Distributions

Table 11.1. Hadoop Distributions

DistroRemarksFree / Premium
  • The Hadoop Source
  • No packaging except TAR balls
  • No extra tools
Completely free and open source
  • Oldest distro
  • Very polished
  • Comes with good tools to install and manage a Hadoop cluster
Free / Premium model (depending on cluster size)
  • Newer distro
  • Tracks Apache Hadoop closely
  • Comes with tools to manage and administer a cluster
Completely open source
  • MapR has their own file system (alternative to HDFS)
  • Boasts higher performance
  • Nice set of tools to manage and administer a cluster
  • Does not suffer from Single Point of Failure
  • Offer some cool features like mirroring, snapshots, etc.
Free / Premium model
  • Encryption support
  • Hardware acceleration added to some layers of stack to boost performance
  • Admin tools to deploy and manage Hadoop
Pivotal HD
  • fast SQL on Hadoop
  • software only or appliance

11.3. Hadoop in the Cloud

Elephants can really fly in the clouds! Most cloud providers offer Hadoop.

Hadoop clusters in the Cloud

Hadoop clusters can be set up in any cloud service that offers suitable machines.

However, in line with the cloud mantra 'only pay for what you use', Hadoop can be run 'on demand' in the cloud.

Amazon Elastic Map Reduce

Amazon offers 'On Demand Hadoop', which means there is no permanent Hadoop cluster. A cluster is spun up to do a job and after that it is shut down - 'pay for usage'.

Amazon offers a slightly customized version of Apache Hadoop and also offers MapR's distribution.

Google's Compute Engine

Google offers MapR's Hadoop distribution in their Compute Engine Cloud.

SkyTab Cloud

SkyTap offers deploy-able Hadoop templates

Skytap announcement || How to

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License