Table of Contents
We met a few members of the Hadoop ecosystem in ???. However the Hadoop ecosystem is bigger than that, and the Big Data ecosystem is even bigger! And, it is growing at a rapid pace. Keeping track of Big Data components / products is now a full time job :-)
In this chapter we are going to meet a few more members.
The following sites are great reference as well
Most of the big data originates outside the Hadoop cluster. These tools will help you get data into HDFS.
Table 12.1. Tools for Getting Data into HDFS
Tool | Remarks |
---|---|
Flume | Gathers data from multiple sources and gets it into HDFS. |
Scribe | Distributed log gatherer, originally developed by Facebook. It hasn't been updated recently. |
Chukwa | Data collection system. |
Sqoop | Transfers data between Hadoop and Relational Databases (RDBMS) |
Kafka | Distributed publish-subscribe system. |
Table 12.2. Hadoop Compute Frameworks
Tool | Remarks |
---|---|
Map reduce | Original distributed compute framework of Hadoop |
YARN | Next generation MapReduce, available in Hadoop version 2.0 |
Weave | Simplified YARN programming |
Cloudera SDK | Simplified MapReduce programming |
Table 12.3. Querying Data in HDFS
Tool | Remarks |
---|---|
Java MapReduce | Native mapreduce in Java |
Hadoop Streaming | Map Reduce in other languages (Ruby, Python) |
Pig | Pig provides a higher level data flow language to process data. Pig scripts are much more compact than Java Map Reduce code. |
Hive | Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code. |
Cascading Lingual | Executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters. |
Stinger / Tez | Next generation Hive. |
Hadapt |
Provides SQL support for Hadoop. (commercial product) |
Greenplum HAWQ |
Relational database with SQL support on top of Hadoop HDFS. (commercial product) |
Cloudera Search | Text search on HDFS |
Impala | Provides real time queries over Big Data. Developed by Cloudera. |
Presto | Developed by Facebook, provides fast SQL querying over Hadoop |
Table 12.4. SQL Querying Data in HDFS
Tool | Remarks |
---|---|
Hive | Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code. |
Stinger / Tez | Next generation Hive. |
Hadapt | Provides SQL support for Hadoop. (commercial product) |
Greenplum HAWQ | Relational database with SQL support on top of Hadoop HDFS. (commercial product) |
Impala | Provides real time queries over Big Data. Developed by Cloudera. |
Presto | Developed by Facebook, provides fast SQL querying over Hadoop |
Phoenix | SQL layer over HBase. Developed by SalesForce.com. |
Spire | SQL layer over HBase. Developed by DrawnToScale.com. |
Citus Data | Relational database with SQL support on top of Hadoop HDFS. (commercial product) |
Apache Drill | Interactive analysis of large scale data sets. |
Presto | Developed by Facebook, provides fast SQL querying over Hadoop |
Table 12.5. Real time queries
Tool | Remarks |
---|---|
Apache Drill | Interactive analysis of large scale data sets. |
Impala | Provides real time queries over Big Data. Developed by Cloudera. |
Table 12.6. Stream Processing Tools
Tool | Remarks |
---|---|
Storm | Fast stream processing developed by Twitter. |
Apache S4 | |
Samza | |
Malhar | Massively scalable, fault-tolerant, stateful native Hadoop platform, developed by DataTorrent |
Table 12.7. NoSQL stores for Big Data
Tool | Remarks |
---|---|
HBase | NoSQL built on top of Hadoop. |
Cassandra | NoSQL store (does not use Hadoop). |
Redis | Key value store. |
Amazon SimpleDB | Offered by Amazon on their environment. |
Voldermort | Distributed key value store developed by LinkedIn. |
Accumulo | A NoSQL store developed by NSA (yes, that agency!). |
Table 12.8. Hadoop in the Cloud
Tool | Remarks |
---|---|
Amazon Elastic Map Reduce (EMR) | On demand Hadoop on Amazon Cloud. |
Hadoop on Rackspace | On demand and managed Hadoop at Rackspace |
Hadoop on Google Cloud | Hadoop runs on Google Cloud |
Whirr | Tool to easily spin up and manage Hadoop clusters on cloud services like Amazon / RackSpace. |
Table 12.9. Work flow Tools
Tool | Remarks |
---|---|
Oozie | Orchestrates map reduce jobs. |
Azkaban | |
Cascading | Application framework for Java developers to develop robust Data Analytics and Data Management applications on Apache Hadoop. |
Scalding | Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading. |
Lipstick | Pig work flow visualization |
Table 12.13. Distributed Coordination
Tool | Remarks |
---|---|
Zookeeper | ZooKeeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization. |
Book keeper | Distributed logging service based on ZooKeeper. |
Table 12.14. Data Analytics on Hadoop
Tool | Remarks |
---|---|
R language | Software environment for statistical computing and graphics. |
RHIPE | Integrates R and Hadoop. |
Table 12.18. Libraries / Frameworks
Tool | Remarks |
---|---|
Kiji | Build Real-time Big Data Applications on Apache HBase |
Elephant Bird | Compression codes and serializers for Hadoop. |
Summing Bird | MapReduce on Storm / Scalding |
Apache Crunch | Simple, efficient MapReduce piplelines for Hadoop and Spark |
Apache DataFu | Pig UDFs that provide cool functionality |
Continuuity | Build apps on HBase easily |
Table 12.22. Miscellaneous Stuff
Tool | Remarks |
---|---|
Spark | In memory analytics engine developed by Berkeley AMP labs. |
Shark (Hive on Spark) | Hive compatible data warehouse system developed at Berkeley. Claims to be much faster than Hive. |