Why Hadoop required ?
Every day a large amount of unstructured data (that has outgrown in size) is getting dumped into our machines. The major challenge is not to store large data sets in our system but to retrieve and analyse the big data in the organizations that too data present in different machines at different locations. In this situation a necessity for Hadoop arises. Hadoop has the ability to analyse the data present in different machines at different locations very quickly and in a very cost effective way .It uses the concept of MapReduce which enables it to divide the query into small parts and process them in parallel. This is also known as parallel computing.
Here it saying the Hadoop stores petabytes of data, it doesn’t mean that Hadoop is database. Hadpop framework can handle and process large amount of data.
RDBMS vs Hadoop
• SQL designed for Structured Data
• Scales up –Need Database-class servers
• Economical -a machine with four times the power of a standard PC costs a lot more than putting four such PCs in a cluster
• Key Value Pairs instead of Relational Tables
• Functional Programming (Map Reduce) instead of SQL
• Offline processing instead of online transaction processing
|Node Based, Flat file structure||Relational Database|
|Analytical and BigData process||OLTP process|
|Structure, Semi structure , Un structure||Structure schema based|
|Inbuilt fault tolerance||Need configuration on fault tolerance|
|Petabytes and greater||Tera bytes of data|
|Write once and Read more||Read/Write more|
History of Hadoop
More Details on Hadoop history
Hadoop is an open source software framework that supports data intensive distributed applications which is licensed under Apache v2 license.
It was 2002 when the seeds of Hadoop was first planted, Doug Cutting and Mike Cafarella started working on the project known as Nutch. It wasn’t easy to work on Nutch as it run across a handful of machines and someone had to watch it around the clock to make sure it didn’t fall down.
In October 2003 Google released the Google File System paper and the MapReduce paper.
In December 2004 Doug Cutting added MapReduce support to Nutch.
In 2006, Cutting went to work with Yahoo, which was equally impressed by the Google File System and MapReduce papers and wanted to build open source technologies based on them.
In 2008 Yahoo ran 4,000 Hadoop cluster and Hadoop won terabyte sort benchmark.
In 2009, Facebook launched SQL support for Hadoop.
Almost everywhere you go online now, Hadoop is there in some capacity. Facebook, eBay, Etsy, Yelp , Twitter, Salesforce.com.
Current Common Challenges
• How to understand data in different forms
• Store and analysis process
• Data transfer rate between nodes
• Hardware cost & problems
• Fault tolerance
• Analysis speed & data access patterns
Best practices for scaling with your data
• Accessible : Hadoop runs on large clusters of commodity machines or on cloud.
• Robust: Hadoop is architect with the assumption of frequent hardware malfunctions. It can gracefully handle most such failures.
• Scalable : Hadoop scales linearly to handle larger data by adding more nodes to the cluster any time.
• Simple: Hadoop allows users to quickly write efficient code.
• Data Locality: Move Computation to the Data.
• Replication : Use replication across servers to deal with unreliable storage/servers adoption (default replication 3)
• Business Drivers: Bigger the data, Higher the value
• Financial Drivers: Cost advantage of Open Source + Commodity H/W
• Low cost per TB
• Technical Drivers: Existing systems failing under growing requirements.
Use case of Hadoop
BigData+Social data influencing the election decisions.
Collection of published papers by Oxford Journal (ref: Oxford Journal) in political analysis, all are using innovative technologies for analysing on pools .These published papers show us how BigData tools are help in political research.
Due to global financial crisis in 2008 resulted in the jobs, income, rank of financial company.
Hadoop will help to avoiding this type of risks:
• Quick fraud detection
• Customer Segmentation analysis
• Customer Sentiment analysis
• Risk aggregation
Correlating air quality data with asthma admissions. Life sciences companies use genomic and proteomic data to speed drug development.
The Hadoop data processing and storage platform opens up entire new research domains for discovery.
Hadoop Eco Systems
• HDFS: A distributed file system that runs on large cluster of commodity hardware.
• MapReduce: It is framework for distributed processing of large data sets on commodity hardware.
• Hive: Distributed data warehouse. Data will store in HDFS and it provides query language similar SQL(called HQL)(framework translate query to MR Jobs internally)
• Pig: A high-level data-flow language, used for large datasets, user can do easily work with or without knowledge of programming.
• HBase: A distributed and column-oriented database. It uses HDFS for storage layer and it supports batch-style computations data(random reads).
• Sqoop: Import/export the data from RDBMS databases to/from HDFS.
• Flume: Flume is distributed, reliable service for efficiently collecting, aggregating, and moving large amounts of log data into HDFS.
• Oozie: It is a service for running and schedule workflows of Hadoop tasks (any job we can schedule: mapreduce, hive, pig, sqoop jobs).