Question: Can I set the number of reducers to zero?
Answer: Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS.
Question: What is a IdentityMapper and IdentityReducer in MapReduce ?
er: The identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.
Question: Where is the Mapper Output (intermediate kay-value data) stored ?
Answer: The mapper output (intermediate data) is stored on the Local file system of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.
Question: Does MapReduce programming model provide a way for reducers to communicate with each other?
Answer: Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.
Question: Why doesn’t MapReduce use Java primitive datatype, instead it uses Hadoop types like lntWritable, etc.?
Answer: Because in a big data world, structured objects need to be serialized to a byte stream for moving over the network or persisting to disk on the cluster and then deserialized back again as needed. When you have vast amounts of data to store and move, your data need to be efficient and take as little space to store and time to move as possible.
In order to handle the Objects in Hadoop way. For example, hadoop uses Text instead of java’s String. The Text class in hadoop is similar to a java String, however, Text implements interfaces like Comparable, Writable and WritableComparable.
Question: What is YARN ?
Answer: YARN (Yet Another Resource Negotiator) is a new component added in Hadoop 2.0 .The fundamental idea of YARN is to split up the functionalities of resource management and job scheduling/monitoring into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job or a DAG of jobs.
Question: Explain YARN components ?
Answer: The ResourceManager and the NodeManager form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The NodeManager is the per-machine framework agent who is responsible for containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main components:
Scheduler and ApplicationsManager.
The Scheduler is responsible for allocating resources to the various running applications subject to familiar constraints of capacities, queues etc. The Scheduler is pure scheduler in the sense that it performs no monitoring or tracking of status for the application. Also, it offers no guarantees about restarting failed tasks either due to application failure or hardware failures. The Scheduler performs its scheduling function based the resource requirements of the applications; it does so based on the abstract notion of a resource Container which incorporates elements such as memory, cpu, disk, network etc.
The ApplicationsManager is responsible for accepting job-submissions, negotiating the first container for executing the application specific ApplicationMaster and provides the service for restarting the ApplicationMaster container on failure. The per-application ApplicationMaster has the responsibility of negotiating appropriate resource containers from the Scheduler, tracking their status and monitoring for progress.