Big Data Ecosystem :: Hadoop Illuminated

	Hadoop Illuminated > Big Data Ecosystem

Chapter 12. Big Data Ecosystem

We met a few members of the Hadoop ecosystem in ???. However the Hadoop ecosystem is bigger than that, and the Big Data ecosystem is even bigger! And, it is growing at a rapid pace. Keeping track of Big Data components / products is now a full time job :-)

In this chapter we are going to meet a few more members.

The following sites are great reference as well

hadoopecosystemtable.github.io

12.1. Getting Data into HDFS

Most of the big data originates outside the Hadoop cluster. These tools will help you get data into HDFS.

Table 12.1. Tools for Getting Data into HDFS

Tool	Remarks
Flume	Gathers data from multiple sources and gets it into HDFS.
Scribe	Distributed log gatherer, originally developed by Facebook. It hasn't been updated recently.
Chukwa	Data collection system.
Sqoop	Transfers data between Hadoop and Relational Databases (RDBMS)
Kafka	Distributed publish-subscribe system.

12.2. Compute Frameworks

Table 12.2. Hadoop Compute Frameworks

Tool	Remarks
Map reduce	Original distributed compute framework of Hadoop
YARN	Next generation MapReduce, available in Hadoop version 2.0
Weave	Simplified YARN programming
Cloudera SDK	Simplified MapReduce programming

12.3. Querying data in HDFS

Table 12.3. Querying Data in HDFS

Tool	Remarks
Java MapReduce	Native mapreduce in Java
Hadoop Streaming	Map Reduce in other languages (Ruby, Python)
Pig	Pig provides a higher level data flow language to process data. Pig scripts are much more compact than Java Map Reduce code.
Hive	Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code.
Cascading Lingual	Executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.
Stinger / Tez	Next generation Hive.
Hadapt	Provides SQL support for Hadoop. (commercial product)
Greenplum HAWQ	Relational database with SQL support on top of Hadoop HDFS. (commercial product)
Cloudera Search	Text search on HDFS
Impala	Provides real time queries over Big Data. Developed by Cloudera.
Presto	Developed by Facebook, provides fast SQL querying over Hadoop

12.4. SQL on Hadoop / HBase

Table 12.4. SQL Querying Data in HDFS

Tool	Remarks
Hive	Hive provides an SQL layer on top of HDFS. The data can be queried using SQL rather than writing Java Map Reduce code.
Stinger / Tez	Next generation Hive.
Hadapt	Provides SQL support for Hadoop. (commercial product)
Greenplum HAWQ	Relational database with SQL support on top of Hadoop HDFS. (commercial product)
Impala	Provides real time queries over Big Data. Developed by Cloudera.
Presto	Developed by Facebook, provides fast SQL querying over Hadoop
Phoenix	SQL layer over HBase. Developed by SalesForce.com.
Spire	SQL layer over HBase. Developed by DrawnToScale.com.
Citus Data	Relational database with SQL support on top of Hadoop HDFS. (commercial product)
Apache Drill	Interactive analysis of large scale data sets.
Presto	Developed by Facebook, provides fast SQL querying over Hadoop

12.5. Real time querying

Table 12.5. Real time queries

Tool	Remarks
Apache Drill	Interactive analysis of large scale data sets.
Impala	Provides real time queries over Big Data. Developed by Cloudera.

12.6. Stream Processing

Table 12.6. Stream Processing Tools

Tool	Remarks
Storm	Fast stream processing developed by Twitter.
Apache S4
Samza
Malhar	Massively scalable, fault-tolerant, stateful native Hadoop platform, developed by DataTorrent

12.7. NoSQL stores

Table 12.7. NoSQL stores for Big Data

Tool	Remarks
HBase	NoSQL built on top of Hadoop.
Cassandra	NoSQL store (does not use Hadoop).
Redis	Key value store.
Amazon SimpleDB	Offered by Amazon on their environment.
Voldermort	Distributed key value store developed by LinkedIn.
Accumulo	A NoSQL store developed by NSA (yes, that agency!).

12.8. Hadoop in the Cloud

Table 12.8. Hadoop in the Cloud

Tool	Remarks
Amazon Elastic Map Reduce (EMR)	On demand Hadoop on Amazon Cloud.
Hadoop on Rackspace	On demand and managed Hadoop at Rackspace
Hadoop on Google Cloud	Hadoop runs on Google Cloud
Whirr	Tool to easily spin up and manage Hadoop clusters on cloud services like Amazon / RackSpace.

12.9. Work flow Tools / Schedulers

Table 12.9. Work flow Tools

Tool	Remarks
Oozie	Orchestrates map reduce jobs.
Azkaban
Cascading	Application framework for Java developers to develop robust Data Analytics and Data Management applications on Apache Hadoop.
Scalding	Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading.
Lipstick	Pig work flow visualization

12.10. Serialization Frameworks

Table 12.10. Serialization Frameworks

Tool	Remarks
Avro	Data serialization system.
Trevni	Column file format.
Protobuf	Popular serialization library (not a Hadoop project).
Parquet	columnar storage format for Hadoop

12.11. Monitoring Systems

Table 12.11. Tools for Monitoring Hadoop

Tool	Remarks
Hue	Developed by Cloudera.
Ganglia	Overall host monitoring system. Hadoop can publish metrics to Ganglia.
Open TSDB	Metrics collector and visualizer.
Nagios	IT infrastructure monitoring.

12.12. Applications / Platforms

Table 12.12. Applications that run on top of Hadoop

Tool	Remarks
Mahout	Recommendation engine on top of Hadoop.
Giraph	Fast graph processing on top of Hadoop.
Lily	Lily unifies Apache HBase, Hadoop and Solr into a comprehensively integrated, interactive data platform

12.13. Distributed Coordination

Table 12.13. Distributed Coordination

Tool	Remarks
Zookeeper	ZooKeeper is a centralized service for maintaining configuration information, naming, and providing distributed synchronization.
Book keeper	Distributed logging service based on ZooKeeper.

12.14. Data Analytics Tools

Table 12.14. Data Analytics on Hadoop

Tool	Remarks
R language	Software environment for statistical computing and graphics.
RHIPE	Integrates R and Hadoop.

12.15. Distributed Message Processing

Table 12.15. Distributed Message Processing

Tool	Remarks
Kafka	Distributed publish-subscribe system.
Akka	Distributed messaging system with actors.
RabbitMQ	Distributed MQ messaging system.

12.16. Business Intelligence (BI) Tools

Table 12.16. Business Intelligence (BI) Tools

Tool	Remarks
Datameer
Tableau
Pentaho
SiSense
SumoLogic

12.17. YARN-based frameworks

Table 12.17. YARN-based frameworks

Tool	Remarks
Samza
Spark	Spark on YARN
Malhar
Giraph	Giraph on YARN
Storm	Storm on YARN
Hoya (HBase on YARN)
Malhar

12.18. Libraries / Frameworks

Table 12.18. Libraries / Frameworks

Tool	Remarks
Kiji	Build Real-time Big Data Applications on Apache HBase
Elephant Bird	Compression codes and serializers for Hadoop.
Summing Bird	MapReduce on Storm / Scalding
Apache Crunch	Simple, efficient MapReduce piplelines for Hadoop and Spark
Apache DataFu	Pig UDFs that provide cool functionality
Continuuity	Build apps on HBase easily

12.19. Data Management

Table 12.19. Data Management

Tool	Remarks
Apache Falcon	Data management, data lineage

12.20. Security

Table 12.20. Security

Tool	Remarks
Apache Sentry
Apache Knox

12.21. Testing Frameworks

Table 12.21. Testing Frameworks

Tool	Remarks
MrUnit	Unit Testing frameworks for Java MapReduce
PigUnit	For testing Pig scripts

12.22. Miscellaneous

Table 12.22. Miscellaneous Stuff

Tool	Remarks
Spark	In memory analytics engine developed by Berkeley AMP labs.
Shark (Hive on Spark)	Hive compatible data warehouse system developed at Berkeley. Claims to be much faster than Hive.


Chapter 11. Hadoop Distributions		Chapter 13. Business Intelligence Tools For Hadoop and Big Data