What is HDFS ?
HDFS stands for Hadoop Distributed File System is a file system designed for storage of very large files with streaming data access patterns on commodity hardware.
Normal File System vs. HDFS
Block Size: It is minimum amount of data that is written in disk, normally block size is 512 bytes but HDFS file size is 64mb default so it will read and write very high data.
Metadata Storage: Normal file system maintains hierarchies ,until the root of the file system but in HDFS we do not have hierarchy of the metadata. All these metadata resides on name node (or master node) on the cluster.
Replication: Hadoop cluster is fault tolerant i.e. even if one or more data node corrupts or not responds then data will be safe, Hadoop uses the data replication in different nodes (framework will take care). If data not received from specified data node then it will retrieve data from another data node.
Overview of HDFS
NameNode: It is the heart of an HDFS file system. it maintains metadata & file tree structure of all files and also knows the datanode on which each split of files are stored across cluster.
Actually Name Node is a single point of failure for the HDFS, i.e. if Name Node is down we cannot access any files (file system goes offline). The way is Secondary Name Node.
Secondary Name Node: Secondary Name Node also contains a name space logs and edit logs like Name Node. It will take backup every certain interval of time (Hour is default).
In case any instance of time the Name Node goes down or corrupt then the secondary name node is an optional that can be hosted on a separate machine and when Name Node comes up then secondary name nodes image will copy into Name Node.
DataNode: The storage of actual data is datanode, it will store the data and retrieve the data when ever name node asks.
It will report to Name Node periodically with lists of blocks that they are storing. If Name Node does not receive any heartbeat from data nodes then it will treat as dead and does not forward any new IO requests.
The datanode invokes the send Heartbeat RPC once every 3 seconds. The DataNode invokes the send Block Modifications RPC once every 10 heartbeats.
Best practices for scaling with your data
• Accessible—Hadoop runs on large clusters of commodity machines or on cloud.
• Robust—Hadoop is architecture with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
• Scalable—Hadoop scales linearly to handle larger data by adding more nodes to the cluster any time.
• Simple—Hadoop allows users to quickly write efficient code.
• Data Locality—Move Computation to the Data.
• Replication -Use replication across servers to deal with unreliable storage/servers Adoption (default replication 3)
• Business Drivers: Bigger the data, Higher the value
• Financial Drivers: Cost advantage of Open Source + Commodity H/W
• Low cost per TB
• Technical Drivers: Existing systems failing under growing requirements –3 Vs
Hadoop configuration is driven by two types of important configuration files.
• Read-only default configuration – src/core/core-default.xml, src/hdfs/hdfs-default.xml and src/mapred/mapred-default.xml.
• Site-specific configuration – conf/core-site.xml, conf/hdfs-site.xml and conf/mapred-site.xml.
• To learn more about how the Hadoop framework is controlled by these configuration files, look here.
• Additionally, you can control the Hadoop scripts found in the bin/ directory of the distribution, by setting site-specific values via the conf/hadoop-env.sh.
To configure the Hadoop cluster you will need to configure the environment in which the Hadoop daemons execute as well as the configuration parameters for the Hadoop daemons.
The Hadoop daemons are NameNode/DataNode and JobTracker/TaskTracker.
Configuring the Environment of the Hadoop Daemons
Administrators should use the conf/hadoop-env.sh script to do site-specific customization of the Hadoop daemons’ process environment.
At the very least you should specify the JAVA_HOME so that it is correctly defined on each remote node.
In most cases you should also specify HADOOP_PID_DIR to point a directory that can only be written to by the users that are going to run the hadoop daemons. Otherwise there is the potential for a symlink attack.
“Administrators” can configure individual daemons using the configuration options HADOOP_*_OPTS.
Interaction with HDFS
Some commands when we interacting with HDFS.
• Listing file : hdfs dfs –ls
• Insert Data into the Cluster (3 steps):
Step 1. Create hadoop dir
bin/hadoop dfs -mkdir /user/username(if it is multiple node)
bin/hadoopdfs-mkdir/ (single node)
• Step 2: Put file to cluster
bin/hadoop dfs -put /home/username/DataSet.txt /user/username/
• Step 3: Verify the file is in HDFS
dfs -ls /user/yourUserName
• Uploading multiple files at once (specify directory to upload):
bin/hadoop -put /myfiles /user/username
• Note: Another synonym for -put is -copyFromLocal.
• Display Files in HDFS:
bin/hadoop dfs -cat file(will display in console)
• Copy File from HDFS:
bin/hadoop dfs -get file localfile