Pig was initially developed at Yahoo!. Apache Pig is one of the components of Hadoop is used for processing and analysing large data sets. It runs on top of Hadoop Distributed File System (HDFS).
Why we use Pig :
- Spend less time to write mapper and reducer programs, fewer lines of code. Therefore less maintain cost.
- Pig Programming is known as PigLatin. PigLatin is similar to SQL Language.Therefore, it is easy to learn if you are familiar with SQL.
- We can use UDF’s (user defined function) to achieve particular processing using Java and other embedded programs.
How to use Pig :
1. First to “LOAD” data which you want to process from HDFS.
To Load data give the directory path where your data is stored in HDFS.
Also, specify the format of data like its tab separated, comma separated. Example: “USING PigStorage(‘,’)”.
2. Then, you can apply various functions to GROUP the data, Aggregate data by using SUM, Sort your data using ORDER.
3. At Last, you can print your data by using DUMP to print data on the screen. Also, we can store the data in HDFS using STORE for further analysing.
Let’s start with the example of Word Count :
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred.
Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word and emits a single key/value with the word and sum.
- Prepare data : Just take any data any format. I have taken poem just for sample .It will work for any size of data from few KB to Terabytes/Petabytes- depends on the configuration of your machine.
2. Load data in Pig, specify Directory where data is stored. Go to terminal open your Pig Shell,type “pig”
“grunt> ” will open and we can start with data processing.Define alias like in below example : “A” and assign the command.
3. (i) Transform your data. FOREACH- As word suggests foreach word associated with alias “A” transform it. TOKENIZE – it works similar to JAVA tokenize , split a string of words. FLATTEN used un-nests tuples as well as bags. $0 – $ used to tell the position of columns or we can specify columns at the first place while loading data.
4. (ii) Transform your data. GROUP operator groups together tuples that have the same group key . Example the words which are repeated will be grouped together.
5.(iii) Transform your data. GROUP creates Bag which contains same words, now to count the same word in a group we use COUNT. It counts the same key in a Bag.
Last Step : DUMP your data – print your data on the screen. The output of number of times the word occurs in the data.
For info on PigLAtin Functions and Operator , you can refer PigLatin functions.