Chapter 12. Big Data Ecosystem

Table of Contents

12.1. Getting Data into HDFS
12.2. Compute Frameworks
12.3. Querying data in HDFS
12.4. SQL on Hadoop / HBase
12.5. Real time querying
12.6. Stream Processing
12.7. NoSQL stores
12.8. Hadoop in the Cloud
12.9. Work flow Tools / Schedulers
12.10. Serialization Frameworks
12.11. Monitoring Systems
12.12. Applications / Platforms
12.13. Distributed Coordination
12.14. Data Analytics Tools
12.15. Distributed Message Processing
12.16. Business Intelligence (BI) Tools
12.17. YARN-based frameworks
12.18. Libraries / Frameworks
12.19. Data Management
12.20. Security
12.21. Testing Frameworks
12.22. Miscellaneous

We met a few members of the Hadoop ecosystem in ???. However the Hadoop ecosystem is bigger than that, and the Big Data ecosystem is even bigger! And, it is growing at a rapid pace. Keeping track of Big Data components / products is now a full time job :-)

In this chapter we are going to meet a few more members.

The following sites are great reference as well

12.1. Getting Data into HDFS

Most of the big data originates outside the Hadoop cluster. These tools will help you get data into HDFS.

Table 12.1. Tools for Getting Data into HDFS

ToolRemarks
Flume Gathers data from multiple sources and gets it into HDFS.
Scribe Distributed log gatherer, originally developed by Facebook. It hasn't been updated recently.
Chukwa Data collection system.
Sqoop Transfers data between Hadoop and Relational Databases (RDBMS)
Kafka Distributed publish-subscribe system.

12.2. Compute Frameworks

Table 12.2. Hadoop Compute Frameworks

ToolRemarks
Map reduce Original distributed compute framework of Hadoop
YARN Next generation MapReduce, available in Hadoop version 2.0
Weave Simplified YARN programming
Cloudera SDK Simplified MapReduce programming

12.3. Querying data in HDFS

Table 12.3. Querying Data in HDFS

ToolRemarks
Java MapReduce Native mapreduce in Java
Hadoop Streaming Map Reduce in other languages (Ruby, Python)
Pig Pig provides a higher level data flow language to process data. Pig scripts are much more compact than Java Map Reduce code.
Hive Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code.
Cascading Lingual Executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.
Stinger / Tez Next generation Hive.
Hadapt Provides SQL support for Hadoop.
(commercial product)
Greenplum HAWQ Relational database with SQL support on top of Hadoop HDFS.
(commercial product)
Cloudera Search Text search on HDFS
Impala Provides real time queries over Big Data. Developed by Cloudera.
Presto Developed by Facebook, provides fast SQL querying over Hadoop

12.4. SQL on Hadoop / HBase

Table 12.4. SQL Querying Data in HDFS

ToolRemarks
Hive Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code.
Stinger / Tez Next generation Hive.
Hadapt Provides SQL support for Hadoop. (commercial product)
Greenplum HAWQ Relational database with SQL support on top of Hadoop HDFS. (commercial product)
Impala Provides real time queries over Big Data. Developed by Cloudera.
Presto Developed by Facebook, provides fast SQL querying over Hadoop
Phoenix SQL layer over HBase. Developed by SalesForce.com.
Spire SQL layer over HBase. Developed by DrawnToScale.com.
Citus Data Relational database with SQL support on top of Hadoop HDFS. (commercial product)
Apache Drill Interactive analysis of large scale data sets.
Presto Developed by Facebook, provides fast SQL querying over Hadoop

12.5. Real time querying

Table 12.5. Real time queries

ToolRemarks
Apache Drill Interactive analysis of large scale data sets.
Impala Provides real time queries over Big Data. Developed by Cloudera.

12.6. Stream Processing

Table 12.6. Stream Processing Tools

ToolRemarks
Storm Fast stream processing developed by Twitter.
Apache S4
Samza
Malhar Massively scalable, fault-tolerant, stateful native Hadoop platform, developed by DataTorrent

12.7. NoSQL stores

Table 12.7. NoSQL stores for Big Data

ToolRemarks
HBase NoSQL built on top of Hadoop.
Cassandra NoSQL store (does not use Hadoop).
Redis Key value store.
Amazon SimpleDB Offered by Amazon on their environment.
Voldermort Distributed key value store developed by LinkedIn.
Accumulo A NoSQL store developed by NSA (yes, that agency!).

12.8. Hadoop in the Cloud

Table 12.8. Hadoop in the Cloud

ToolRemarks
Amazon Elastic Map Reduce (EMR) On demand Hadoop on Amazon Cloud.
Hadoop on Rackspace On demand and managed Hadoop at Rackspace
Hadoop on Google Cloud Hadoop runs on Google Cloud
Whirr Tool to easily spin up and manage Hadoop clusters on cloud services like Amazon / RackSpace.

12.9. Work flow Tools / Schedulers

Table 12.9. Work flow Tools

ToolRemarks
Oozie Orchestrates map reduce jobs.
Azkaban
Cascading Application framework for Java developers to develop robust Data Analytics and Data Management applications on Apache Hadoop.
Scalding Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading.
Lipstick Pig work flow visualization

12.10. Serialization Frameworks

Table 12.10. Serialization Frameworks

ToolRemarks
Avro Data serialization system.
Trevni Column file format.
Protobuf Popular serialization library (not a Hadoop project).
Parquet columnar storage format for Hadoop

12.11. Monitoring Systems

Table 12.11. Tools for Monitoring Hadoop

ToolRemarks
Hue Developed by Cloudera.
Ganglia Overall host monitoring system. Hadoop can publish metrics to Ganglia.
Open TSDB Metrics collector and visualizer.
Nagios IT infrastructure monitoring.

12.12. Applications / Platforms

Table 12.12. Applications that run on top of Hadoop

ToolRemarks
Mahout Recommendation engine on top of Hadoop.
Giraph Fast graph processing on top of Hadoop.
Lily Lily unifies Apache HBase, Hadoop and Solr into a comprehensively integrated, interactive data platform

12.13. Distributed Coordination

Table 12.13. Distributed Coordination

ToolRemarks
Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization.
Book keeper Distributed logging service based on ZooKeeper.

12.14. Data Analytics Tools

Table 12.14. Data Analytics on Hadoop

ToolRemarks
R language Software environment for statistical computing and graphics.
RHIPE Integrates R and Hadoop.

12.15. Distributed Message Processing

Table 12.15. Distributed Message Processing

ToolRemarks
Kafka Distributed publish-subscribe system.
Akka Distributed messaging system with actors.
RabbitMQ Distributed MQ messaging system.

12.16. Business Intelligence (BI) Tools

Table 12.16. Business Intelligence (BI) Tools


12.17. YARN-based frameworks


12.18. Libraries / Frameworks

Table 12.18. Libraries / Frameworks

ToolRemarks
Kiji Build Real-time Big Data Applications on Apache HBase
Elephant Bird Compression codes and serializers for Hadoop.
Summing Bird MapReduce on Storm / Scalding
Apache Crunch Simple, efficient MapReduce piplelines for Hadoop and Spark
Apache DataFu Pig UDFs that provide cool functionality
Continuuity Build apps on HBase easily

12.19. Data Management

Table 12.19. Data Management

ToolRemarks
Apache Falcon Data management, data lineage

12.20. Security

Table 12.20. Security


12.21. Testing Frameworks

Table 12.21. Testing Frameworks

ToolRemarks
MrUnit Unit Testing frameworks for Java MapReduce
PigUnit For testing Pig scripts

12.22. Miscellaneous

Table 12.22. Miscellaneous Stuff

ToolRemarks
Spark In memory analytics engine developed by Berkeley AMP labs.
Shark (Hive on Spark) Hive compatible data warehouse system developed at Berkeley. Claims to be much faster than Hive.


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License