Chapter 11. Hadoop Distributions

Table of Contents

11.1. The Case for Distributions
11.2. Overview of Hadoop Distributions
11.3. Hadoop in the Cloud

11.1. The Case for Distributions

Hadoop is Apache software so it is freely available for download and use. So why do we need distributions at all?

This is very akin to Linux a few years back and Linux distributions like RedHat, Suse and Ubuntu. The software is free to download and use but distributions offer an easier to use bundle.

So what do Hadoop distros offer?

Distributions provide easy to install mediums like RPMs

The Apache version of Hadoop is just TAR balls. Distros actually package it nicely into easy to install packages which make it easy for system administrators to manage effectively.

Distros package multiple components that work well together

The Hadoop ecosystem contains a lot of components (HBase, Pig, Hive, Zookeeper, etc.) which are being developed independently and have their own release schedules. Also, there are version dependencies among the components. For example version 0.92 of HBase needs a particular version of HDFS.

Distros bundle versions of components that work well together. This provides a working Hadoop installation right out of the box.

Tested

Distro makers strive to ensure good quality components.

Performance patches

Sometimes, distros lead the way by including performance patches to the 'vanilla' versions.

Predictable upgrade path

Distros have predictable product release road maps. This ensures they keep up with developments and bug fixes.

And most importantly . . SUPPORT

Lot of distros come with support, which could be very valuable for a production critical cluster.

11.2. Overview of Hadoop Distributions

Table 11.1. Hadoop Distributions

DistroRemarksFree / Premium
Apache
hadoop.apache.org
  • The Hadoop Source
  • No packaging except TAR balls
  • No extra tools
Completely free and open source
Cloudera
www.cloudera.com
  • Oldest distro
  • Very polished
  • Comes with good tools to install and manage a Hadoop cluster
Free / Premium model (depending on cluster size)
HortonWorks
www.hortonworks.com
  • Newer distro
  • Tracks Apache Hadoop closely
  • Comes with tools to manage and administer a cluster
Completely open source
MapR
www.mapr.com
  • MapR has their own file system (alternative to HDFS)
  • Boasts higher performance
  • Nice set of tools to manage and administer a cluster
  • Does not suffer from Single Point of Failure
  • Offer some cool features like mirroring, snapshots, etc.
Free / Premium model
Intel
hadoop.intel.com
  • Encryption support
  • Hardware acceleration added to some layers of stack to boost performance
  • Admin tools to deploy and manage Hadoop
Premium
Pivotal HD
gopivotal.com
  • fast SQL on Hadoop
  • software only or appliance
Premium

11.3. Hadoop in the Cloud

Elephants can really fly in the clouds! Most cloud providers offer Hadoop.

Hadoop clusters in the Cloud

Hadoop clusters can be set up in any cloud service that offers suitable machines.

However, in line with the cloud mantra 'only pay for what you use', Hadoop can be run 'on demand' in the cloud.

Amazon Elastic Map Reduce

Amazon offers 'On Demand Hadoop', which means there is no permanent Hadoop cluster. A cluster is spun up to do a job and after that it is shut down - 'pay for usage'.

Amazon offers a slightly customized version of Apache Hadoop and also offers MapR's distribution.

Google's Compute Engine

Google offers MapR's Hadoop distribution in their Compute Engine Cloud.

SkyTab Cloud

SkyTap offers deploy-able Hadoop templates

Links:
Skytap announcement || How to


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License