Hadoop Illuminated

Mark Kerzner


Sujee Maniyam



To the open source community

This book on GitHub
Companion project on GitHub


To Hadoop community

  • Apache Hadoop is an open source software from Apache Software Foundation.
  • Apache, Apache Hadoop, and Hadoop are trademarks of The Apache Software Foundation. Used with permission. No endorsement by The Apache Software Foundation is implied by the use of these marks
  • For brevity we will refer Apache Hadoop as Hadoop

From Mark
I would like to express gratitude to my editors, co-authors, colleagues, and bosses who shared the thorny path to working clusters - with the hope to make it less thorny for those who follow. Seriously, folks, Hadoop is hard, and Big Data is tough, and there are many related products and skills that you need to master. Therefore, have fun, provide your feedback , and I hope you will find the book entertaining.

"The author's opinions do not necessarily coincide with his point of view." - Victor Pelevin, "Generation P"

From Sujee
To the kind souls who helped me along the way.

Copyright 2013 Hadoop Illuminated LLC

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at


Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Table of Contents

1. Who is this book for?
1.1. About "Hadoop illuminated"
2. About Authors
3. Big Data
3.1. What is Big Data?
3.2. Human Generated Data and Machine Generated Data
3.3. Where does Big Data come from
3.4. Examples of Big Data in the Real world
3.5. Challenges of Big Data
3.1. Taming Big Data
4. Hadoop and Big Data
4.1. How Hadoop solves the Big Data problem
4.1. Business Case for Hadoop
5. Hadoop for Executives
6. Hadoop for Developers
7. Soft Introduction to Hadoop
7.1. Hadoop = HDFS + MapReduce
7.2. Why Hadoop?
7.3. Meet the Hadoop Zoo
7.4. Hadoop alternatives
7.5. Alternatives for distributed massive computations
7.6. Arguments for Hadoop
8. Hadoop Distributed File System (HDFS) -- Introduction
8.1. HDFS Concepts
8.1. HDFS Architecture
9. Introduction To MapReduce
9.1. How I failed at designing distributed processing
9.2. How MapReduce does it
9.3. How MapReduce really does it
9.1. Understanding Mappers and Reducers
9.4. Who invented this?
9.5. The benefits of MapReduce programming
10. Hadoop Use Cases and Case Studies
10.1. Politics
10.2. Data Storage
10.3. Financial Services
10.4. Health Care
10.5. Human Sciences
10.6. Telecoms
10.7. Travel
10.8. Energy
10.9. Logistics
10.10. Retail
10.11. Software / Software As Service (SAS) / Platforms / Cloud
10.12. Imaging / Videos
10.13. Online Publishing , Personalized Content
11. Hadoop Distributions
11.1. The Case for Distributions
11.2. Overview of Hadoop Distributions
11.3. Hadoop in the Cloud
12. Big Data Ecosystem
12.1. Getting Data into HDFS
12.2. Compute Frameworks
12.3. Querying data in HDFS
12.4. SQL on Hadoop / HBase
12.5. Real time querying
12.6. Stream Processing
12.7. NoSQL stores
12.8. Hadoop in the Cloud
12.9. Work flow Tools / Schedulers
12.10. Serialization Frameworks
12.11. Monitoring Systems
12.12. Applications / Platforms
12.13. Distributed Coordination
12.14. Data Analytics Tools
12.15. Distributed Message Processing
12.16. Business Intelligence (BI) Tools
12.17. YARN-based frameworks
12.18. Libraries / Frameworks
12.19. Data Management
12.20. Security
12.21. Testing Frameworks
12.22. Miscellaneous
13. Business Intelligence Tools For Hadoop and Big Data
13.1. The case for BI Tools
13.2. BI Tools Feature Matrix Comparison
13.3. Glossary of terms
14. Hardware and Software for Hadoop
14.1. Hardware
14.2. Software
15. Hadoop Challenges
15.1. Hadoop is a cutting edge technology
15.2. Hadoop in the Enterprise Ecosystem
15.3. Hadoop is still rough around the edges
15.4. Hadoop is NOT cheap
15.5. Map Reduce is a different programming paradigm
15.6. Hadoop and High Availability
16. Publicly Available Big Data Sets
16.1. Pointers to data sets
16.2. Generic Repositories
16.3. Geo data
16.4. Web data
16.5. Government data
17. Big Data News and Links
17.1. news sites
17.2. blogs from hadoop vendors

List of Figures

3.1. Tidal Wave of Data
4.1. Too Much Data
4.2. Scaling Storage
6.1. Hadoop Job Trends
7.1. Hadoop coin
7.2. Will you join the Hadoop dance?
7.3. The Hadoop Zoo
8.1. Cray computer
8.2. HDFS file replication
8.3. HDFS master / worker design
8.4. HDFS architecture
8.5. Disk seek vs scan
8.6. HDFS file append
9.1. Dreams
9.2. MapReduce analogy : Exit Polling

List of Tables

6.1. Hadoop Roles
7.1. Comparison of Big Data
11.1. Hadoop Distributions
12.1. Tools for Getting Data into HDFS
12.2. Hadoop Compute Frameworks
12.3. Querying Data in HDFS
12.4. SQL Querying Data in HDFS
12.5. Real time queries
12.6. Stream Processing Tools
12.7. NoSQL stores for Big Data
12.8. Hadoop in the Cloud
12.9. Work flow Tools
12.10. Serialization Frameworks
12.11. Tools for Monitoring Hadoop
12.12. Applications that run on top of Hadoop
12.13. Distributed Coordination
12.14. Data Analytics on Hadoop
12.15. Distributed Message Processing
12.16. Business Intelligence (BI) Tools
12.17. YARN-based frameworks
12.18. Libraries / Frameworks
12.19. Data Management
12.20. Security
12.21. Testing Frameworks
12.22. Miscellaneous Stuff
13.1. BI Tools Comparison : Data Access and Management
13.2. BI Tools Comparison : Analytics
13.3. BI Tools Comparison : Visualizing
13.4. BI Tools Comparison : Connectivity
13.5. BI Tools Comparison : Misc
14.1. Hardware Specs

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. Creative Commons License