In earlier class we learned about Tutorial 5: Hadoop MapReduce. Checkout all articles in the Big Data FREE course here:
Big Data Course Syllabus
Tutorial | Introduction to BigData |
Tutorial | Introduction to Hadoop Architecture, and Components |
Tutorial | Hadoop installation on Windows |
Tutorial | HDFS Read & Write Operation using Java API |
Tutorial | Hadoop MapReduce |
Tutorial | Hadoop MapReduce First Program |
Tutorial | Hadoop MapReduce and Counter |
Tutorial | Apache Sqoop |
Tutorial | Apache Flume |
Tutorial | Hadoop Pig |
Tutorial | Apache Oozie |
Tutorial | Big Data Testing |
In this tutorial, we are going to write our first program in Hadoop MapReduce in order to understand the functionality in detail. Like any other computer program, Hadoop requires an input data which we are going to provide in the form of spreadsheet. The spreadsheet [ItemsSalesData.csv] as an input can have the following data fields to the sales deed.
- Sales Date
- Item name
- Item price
- Payment Method
- Customer Name
- Customer Residence City
- Customer Residence Province
- Customer Country
The end goal of Hadoop MapReduce program is to figure out the number of items Sold in each country specified for the customers in the spreadsheet [ItemsSalesData.csv].
Step 1: First of all, you need to ensure that Hadoop has installed on your machine. To begin with the actual process, you need to change user to ‘hduser’ I.e. id used during Hadoop configuration. Later, you can change to the userid used for your Hadoop config.
Step 2: Do the following in order to create a folder with required permissions.
Step 3: Write the MapReduce program for the following Java Classes and ensure the deployed binaries in the above folder have the read permission.
ItemsMapper.java
import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; /** * * @author STC * */ public class ItemsMapper { private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String valueString = value.toString(); String[] SingleCountryData = valueString.split(","); output.collect(new Text(SingleCountryData[7]), one); } }
ItemsCountryReducer.java
import java.io.IOException; import java.util.*; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.*; /** * * @author STC * */ public class ItemsCountryReducer { public void reduce (Text textKey, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { Text key = textKey; int frequencyForCountry = 0; while (values.hasNext()) { // Replacement of the type of value with the actual type of value. IntWritable value = (IntWritable) values.next (); frequencyForCountry += value.get (); } output.collect (key, new IntWritable(frequencyForCountry)); } }
ItemsCountryDriver.java
import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; /** * * @author STC * */ public class ItemsCountryDriver { public static void main (String [] args) { JobClient myJobclient = new JobClient (); // Creation of a configuration object for the Job JobConf jobConf = new JobConf (ItemsCountryDriver.class); // Setting a name of the Job jobConf.setJobName ("SalePerCountry"); // Specify data type of output key and value jobConf.setOutputKeyClass(Text.class); jobConf.setOutputValueClass(IntWritable.class); // Specify names of Mapper and Reducer Class jobConf.setMapperClass((Class<? extends Mapper>) ItemsMapper. class ); jobConf.setReducerClass((Class<? extends Reducer>) itemcountry. ItemsCountryReducer.class ); // Specifying the formats of the data type of Input and output. jobConf.setInputFormat(TextInputFormat.class); jobConf.setOutputFormat(TextOutputFormat.class); // Here, we need to Set input and output directories // arg [0] = represents the name of input directory on HDFS // arg [1] = represents the name of output directory on HDFS FileInputFormat.setInputPaths(jobConf, new Path(args [0])); FileOutputFormat.setOutputPath(jobConf, new Path(args [1])); myJobclient.setConf(jobConf); try { // Execute the job JobClient.runJob(jobConf); } catch (Exception exp) { exp.printStackTrace (); } } }
Step 4: Export Class path as shown below. Above three java classes require the following runtime libraries and therefore, these paths need to be exported.
- hadoop-mapreduce-client-core-3.2.0.jar
- hadoop-mapreduce-client-common-3.2.0.jar
- hadoop-common-3.2.0.jar
- hadoop-mapred-0.22.jar
Step 5: Compile the above Java files by using the following command. The compiled binaries i.e. class files will be put in the package directory.
The above compilation will create a directory in a current directory i.e. a directory inside HadoopMapReduceExample with name itemcountry I.e. the package name specified in the java source file and put all three compiled class files i.e. binaries in it.
Step 6: Create a new file Manifest.txt [sudo gedit Manifest.txt] and add the following files into it. It is nothing but the fully qualified name of the java main class. Don’t forget to hit enter key after adding the line.
Step 7: Create a JAR file with the help of the following command.
Step 8: Start the Hadoop after executing the following commands.
Step 9: Copy and paste the spreadsheet [ItemsSalesData.csv] which has data for the item sales country wise at location ~/inputMapReduce. Next, use the following command to copy ~/inputMapReduce to HDFS.
Step 10: After successful copying of the CSV spreadsheet [ItemsSalesData.csv] with Item sales data for countries, we need to run the MapReduce Job.
It will create an output directory with the name as mapreduce_output_item_sales on HDFS. The directory content will be a file containing product sales per country. The result can be visible through command interface as given below.
Similar the output result can be seen on the Hadoop web interface when the mapreduce_output_item_sales directory is browsed from the URL.
Conclusion
In this tutorial, we discussed about the environment setup and use of Hadoop MapReduce program to extract country wise item sales from the spreadsheet [ItemsSalesData.csv] with 8 columns in order to demonstrate the operation of Hadoop HDFS with MapReduce program.
>>> Checkout Big Data Tutorial List <<<
⇓ Subscribe Us ⇓
If you are not regular reader of this website then highly recommends you to Sign up for our free email newsletter!! Sign up just providing your email address below:
Happy Testing!!!