Question: What is Big Data ?
Answer: Big data is a term for data sets where data is so large or complex that it is difficult to capture data, analysis, management, search,sharing.storage,transfer,visualisation,querying in the traditional way. Big data term also refers where it makes you take better decisions, operational efficiency , cost reduction and reduced risk.
Big Data is often defined by four V’s :
a) Volume –Scale of data
b) Velocity –Analysis of streaming data
c) Variety – Different forms of data
d) Veracity –Uncertainty of data
Question: Differentiate between Structured and Unstructured data?
Answer: Data which can be stored in traditional database systems in the form of rows and columns, for example, the online transactions can be related to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi structured data. Unorganised and raw data that cannot be categorised as semi structured or structured data is referred to as unstructured data. Facebook updates, Tweets on Twitter, Reviews, web logs, etc. are all examples of unstructured data.
Question: What are the core components and other essential of Hadoop?
Answers: Core components of Hadoop :
HDFS is to store Data Sets.
MapReduce is to process data sets.
Other essentials i.e Hadoop daemons :
Namenode: It is the Master node which is responsible for storing the meta data for all the files and directories. Capturing information about the data where it is stored.
Datanode: It is the Slave node that contains the actual data. It reports information of the data it contains to Namenode that it is alive.
Secondary Namenode: It periodically merges changes in the NameNode with the edit log so that it doesn’t grow too large in size. It also keeps a copy of the image which can be used in case of failure of NameNode.
JobTracker: This is a daemon that runs on a Namenode for submitting and tracking jobs in Hadoop. It assigns the tasks to different task trackers.
TaskTracker: This is a daemon that runs on Datanodes. Task Trackers manage the execution of individual tasks on the slave node.
ResourceManager (Hadoop 2.x): It is the central authority that manages resources and schedules applications running on top of YARN.
NodeManager (Hadoop 2.x): It runs on slave machines, and is responsible for launching the application’s containers, monitoring their resource usage (CPU, memory, disk, network) and reporting these to the ResourceManager.
Question: Name the three modes in which Hadoop can be run.
Answer: The three modes in which Hadoop can be run are:
1. Standalone (local) mode
Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.
2. Pseudo-distributed mode
Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process.
3. Fully distributed mode
Fully Distributed mode is where you have enough machine to run Namenode and Datanode running on different machine. Mainly used in Production.
Question: What happens when DataNode fails ?
Answer: When a DataNode is failed or Dead. The Namenode and Job Tracker comes to know Data Node is dead and hence reschedule the tasks on other Data node which were running on dead Data node.
Question: Why Namenode is the single point of failure?
Answer: Namenode is the master node which knows about the Job and schedules the Job if Namenode goes down everything comes to stop.
Question: What is Hadoop streaming?
Answer: Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer. Therefore gives you flexibility to write Map/Reduce in other programming languages like Python,Perl,Ruby,etc.
Question: Why do we use HDFS for applications having large data sets and not when there are lot of small files?
Answer: HDFS is used to store Large amount of Data. This Data later gets breaks down into 64Mb files and stored into Datanodes. This information about file is stored in Namenode. If there are very small files of large amount then it will occupy much larger space in Namenode compare to large files.