Explain Hadoop Ecosystem and briefly explain its components.

 

Hadoop is a framework which deals with Big Data but unlike other frameworks, it's not a simple framework, it has its own family for processing different thing which is tied up in one umbrella called as Hadoop Ecosystem.
Fig. Hadoop Ecosystem

1) SQOOP : SQL + Hadoop = SQOOP
        When we import any structured data from a table (RDBMS) to HDFS a file is created in HDFS which we can process by either MapReduce program directly or by HIVE or PIG. Similarly, after processing data in HDFS we can store the processed structured data back to another table in RDBMS by exporting through SQOOP.

2) HDFS (Hadoop Distributed File System)
        HDFS is a main component of Hadoop and a technique to store the data in a distributed manner in order to compute fast. HDFS saves data in a block of 64MB (default) or 128MB in size which is a logical splitting of data in a data node in Hadoop cluster. All information about data splits in data node known as metadata is captured in Name node which is again a part of HDFS.

3) MapReduce Framework
        It is another main component of Hadoop and a method of programming in a distributed data stored in an HDFS. We can write MapReduce program by using any language like JAVA, C++, Python, Ruby, etc. By name only MapReduce gives its functionality, Map will do mapping of logic into data and once computation over reducer will collect the result of Map to generate final output result of MapReduce. Eg. Word Count using MapReduce

4) HBASE
        HBase is a non-relational (NoSQL) database that runs on top of HDFS.HBase was created for large tables which have billions of rows and millions of columns with fault tolerance capability and horizontal scalability and based on Google Big Table.

5) HIVE
        Many programmers and analyst are more comfortable with structured query language (SQL) than JAVA or any other programming language for which HIVE is created and later donated to Apache foundation. HIVE mainly deals with structured data which is stored in HDFS with a Query language similar to SQL and known as HQL (Hive Query Language).

6) Pig
        Similar to HIVE, PIG also deals with structured data using PIG Latin language. PIG was originally developed at Yahoo to provide programmer who loves scripting and don't want to use JAVA or Python or SQL to process data. A Pig Latin program is made up of series of operations, or transformations that are applied to the input data which runs MapReduce program in balanced to produce output.

7) Mahout
        Mahout is an open source machine learning library from Apache written in JAVA. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps for too large for a single machine.

8) Oozie
        Oozie is a workflow scheduler that manages the Hadoop jobs. It provides a mechanism to run a particular job at a particular or given time and also repeat that job at predetermined intervals.

9) Zookeeper
        Zookeeper is a distributed or a centralized service that provides working service for a Hadoop cluster. It allows the developer to focus on core application logic without bothering about the distributed nature of the application. It is fast, reliable and simple.