1. What is Big data in simple terms?
A: Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity.
2. What are the five V’s of Big Data?
A: The five V’s of Big data is as follows:
Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
Value –Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.
3. What is the difference between Big Data and Hadoop?
A: Big data is nothing but just a concept which represent the large amount of data and how to handle that data whereas Apache Hadoop is the framework which is used to handle this large amount of data. Hadoop is just a single framework and there are many more in the whole ecosystem which can handle big data.
4. What are the steps to deploy Big data solution?
A: There are three steps to deploy a Big data Solution.
a. Data Ingestion
The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.
b. Data Storage
After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
c. Data Processing
The final step in deploying a big data solution is the data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
5. What are the core components of Hadoop.
A: Hadoop is an open source framework that is meant for storage and processing of big data in a distributed manner. The core components of Hadoop are –
- HDFS (Hadoop Distributed File System) – HDFS is the basic storage system of Hadoop. The large data files running on a cluster of commodity hardware are stored in HDFS. It can store data in a reliable manner even when hardware fails.
- Hadoop MapReduce – MapReduce is the Hadoop layer that is responsible for data processing. It writes an application to process unstructured and structured data stored in HDFS. It is responsible for the parallel processing of a high volume of data by dividing data into independent tasks. The processing is done in two phases Map and Reduce. The Map is the first phase of processing that specifies complex logic code and the Reduce is the second phase of processing that specifies light-weight operations.
- YARN – The processing framework in Hadoop is YARN. It is used for resource management and provides multiple data processing engines i.e. data science, real-time streaming, and batch processing.
6. What are the common input formats in Hadoop?
A: The common input formats in Hadoop –
- Text Input Format – The default input format defined in Hadoop is the Text Input Format.
- Sequence File Input Format – To read files in a sequence, Sequence File Input Format is used.
- Key Value Input Format – The input format used for plain text files (files broken into lines) is the Key Value Input Format.
7. Explain some important features of Hadoop.
A: Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges. Some important features of Hadoop are –
- Open Source – Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
- Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
- Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
- Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in Hadoop environment is not affected by the failure of the machine.
- Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
- High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.
8. What are the different modes in which Hadoop runs.
A: Apache Hadoop runs in the following three modes –
- Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
- Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
- Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.
If you want to learn in detail about HDFS & YARN go through Hadoop Tutorial blog.
9. Explain the major difference between HDFS block and InputSplit.
A: In simple terms, a block is the physical representation of data while split is the logical representation of data present in the block. Split acts as an intermediary between the block and the mapper.
Suppose we have two blocks:
Block 1: ii nntteell
Block 2: Ii ppaatt
Now considering the map, it will read Block 1 from ii to ll but does not know how to process Block 2 at the same time. Here comes Split into play, which will form a logical group of Block 1 and Block 2 as a single block.
It then forms a key–value pair using InputFormat and records reader and sends map for further processing with InputSplit. If you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640 MB (64 MB each) and there are limited resources, you can assign ‘split size’ as 128 MB. This will form a logical group of 128 MB, with only 5 maps executing at a time.
However, if the ‘split size’ property is set to false, the whole file will form one InputSplit and is processed by a single map, consuming more time when the file is bigger.