Executing a Map-Reduce Application on Hadoop using Eclipse

September 10, 2009 at 5:08 am 2 comments

Those who know, don’t speak

And, Those who speak, don’t know

Chinese Spiritual Leader, Laos

Considering above quote from laos, I am speaking (writing), that  means, I dont know. Acutally he made the above proverb for telling about truth, i.e. truth cannot be told, and those make an effort to tell truth, they probably dont know anything about truth.

In the previous article Installing Hadoop, we configured Haddon on windows system. However, we still dont know, if it is working properly. In this article, we will first test, is Hadoop working, executing map-reduce program (Hadoop provides some examples, we  can execute them and test Hadop is proeprty configured or not.), finally we will write a map-reduce application to read a RDF file.

Is Hadoop Working?

To test, if hadoop is working property, i.e. File System is up. We will create some folders on HDFS.

I assume that you have hadoop installed on windows, and presently you are in the hadoop-0.19.2 directory. Also, namenode, secondarynamenode, jobtracker, datanode, tasktracker are up.

Type the following command to create a directory on HDFS

bin/hadoop fs -mkdir In

This will create a directory “In” in HDFS

Now, copy test files into directory “In”, for copying, issue the following command

bin/hadoop fs -put *.txt In

This will copy all the text files inside hadoop-0.19.2 direcoty to “In” directory HDFS

To test, if the files are indeed copies, issue the following command

bin/hadoop fs -ls In

This will display, all the files inside direcotry “In”.

If, the above command dont generate any errors, this means, file system is configured properly, otherwise, you need to format filesystem again. For formatting refer to previous article in this series. Sometimes, it happened that after formatting file system is not working prooperly, this was probably because file system stores an image of data somewhere inside hadoop directory f older, formatting can sometimes lead to inconsistencies. There should be some solution but I could not work it out. In this scenario, I reinstalled the Hadoop.

Now, that hadoop is installed and working properly, install Eclipse.

Open WindowsExplore, go to Hadop-0.19.2 folder, then go to contrib -> eclipse-plugin folder. There is a jar file, copy this file and paste in Eclipse plug-ins folder. You are now, all set to execute map-reduce jobs on Eclipse.

Open Eclipse.

Laungh Map-Reduce persepective - Eclipse by default works in Java Application perspective, you need to change it to Map-Reduce persepective. For this, click on menu-item Windows->Open Perspective->other…, select map-reduce from the available list.

At the bottom of your screen, there is a Map-Reduce location tab, sandwiched inbetween the console tab and javadoc tab. Right click on the blank space, and select New Hadoop Location.. from the context menu. Set the following values

Locatoin Name: localhost

Map/Reduce Master

Host:localhost

Port:9001

DFS Master

Host:localhost

Port:9000

User Name: User

dont change user with user name, keep it as “User”.

After you close the hadoop location setting box, you can see DFS Locations on the left in the project explorer tree. This maps to file system. Here you can see the folder In created by you and the .txt files stored inside In. If this works without any error, Eclipse is configure to use with Hadoop.

Writng map-reuce application

The task at hand requires to read a processed rdf file, in which each line consist of 3 URIs: subject, predicate, object.

We have to create n files, one for each unique subject URI. That means, if there are 2 subject URIs that have different predicate and object, they will go to same file. Name of file will be hashcoded subject URI.

CLick menu item File->New ->Other, from the list select map-reduce project. Enter the project name and click Finish. This will create a Map-Reduce project.  I choose project name as IndividualClassExampe. This will now appear on the left in the Project Explorer.

Next, right click on IndividualClassExample project in the Project Explorer, from the context menu select new->other, and click on Map-Reduce driver from the list.

A dialog box will arrive, write the package name as bike.hadoop and class name as Experiment.java.

THis will open Experiment. java file, copy and past the code below inside Experiment.java

package bike.hadoop;
import org.apache.hadoop.fs.Path;
//import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
//import org.apache.hadoop.mapred.Mapper;
//import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.TextInputFormat;
//import org.apache.hadoop.mapred.TextOutputFormat;
import org.apache.hadoop.fs.*;
public class Experiment {
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(Experiment.class);
conf.setJobName(“SplitFile”);
conf.setMapOutputKeyClass(Text.class);
conf.setMapOutputValueClass(Duo.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(MyMap.class);
//conf.setCombinerClass(Reduce.class);
conf.setReducerClass(MyReduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(KeyBasedMultipleOutputFormat.class);
FileSystem fs = FileSystem.get(conf);
if (fs.exists(new Path(“Outer”)))
fs.delete(new Path(“Outer”),true);
FileInputFormat.setInputPaths(conf, new Path(“In”));
FileOutputFormat.setOutputPath(conf, new Path(“Outer”));
JobClient.runJob(conf);
}
}

package bike.hadoop;

import org.apache.hadoop.fs.Path;

//import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapred.FileInputFormat;

import org.apache.hadoop.mapred.FileOutputFormat;

import org.apache.hadoop.mapred.JobClient;

import org.apache.hadoop.mapred.JobConf;

//import org.apache.hadoop.mapred.Mapper;

//import org.apache.hadoop.mapred.Reducer;

import org.apache.hadoop.mapred.TextInputFormat;

//import org.apache.hadoop.mapred.TextOutputFormat;

import org.apache.hadoop.fs.*;

public class Experiment {

public static void main(String[] args) throws Exception {

JobConf conf = new JobConf(Experiment.class);

conf.setJobName(“SplitFile”);

conf.setMapOutputKeyClass(Text.class);

conf.setMapOutputValueClass(Duo.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(Text.class);

conf.setMapperClass(MyMap.class);

//conf.setCombinerClass(Reduce.class);

conf.setReducerClass(MyReduce.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(KeyBasedMultipleOutputFormat.class);

FileSystem fs = FileSystem.get(conf);

if (fs.exists(new Path(“Outer”)))

fs.delete(new Path(“Outer”),true);

FileInputFormat.setInputPaths(conf, new Path(“In”));

FileOutputFormat.setOutputPath(conf, new Path(“Outer”));

JobClient.runJob(conf);

}

}

This code is the main starting point, where we specify the name of class file, input format, output format, mapper class name, reduce class name, MapOutputKeyClass name , MapOutputValueClass name, keyclass name, keyvalueclassname.
Create another class MyMap.java, copy and paste the code below.
package bike.hadoop;
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reporter;
public class MyMap extends MapReduceBase implements Mapper<LongWritable, Text, Text, Duo> {
public void map(LongWritable key, Text value, OutputCollector<Text, Duo> o, Reporter reporter) throws IOException {
String property=new String();
String object = new String();
String line = value.toString();
System.out.println(line);
StringTokenizer st = new StringTokenizer(line);
System.out.println(“Number of Tokens” + st.countTokens());
Text subject = new Text(st.nextToken());
property = st.nextToken();
object = st.nextToken();
Duo d = new Duo(property,object);
o.collect(subject,d);
}
}
Create anothe class MyReduce.java, copy and paste the code below
package bike.hadoop;
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
public class MyReduce extends MapReduceBase implements Reducer<Text, Duo, Text, Text> {
public void reduce(Text key, Iterator<Duo> values, OutputCollector<Text, Text> o, Reporter reporter) throws IOException {
String finalString=new String();
Duo d = new Duo();
while (values.hasNext()) {
d = values.next();
finalString = d.getProperty()+ ” ” + d.getObject();
o.collect(key, new Text(finalString));
}
}
}
We have a java class for custom Input Format, because we are passing an object of type Duo as input value from map to reduce. Therefore, we specify the format of custom InputFormat. The name of the class is Duo. java. Create another class Duo.java, copy and paste the code below.
package bike.hadoop;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class Duo implements Writable {
private String property;
private String object;
public Duo() {
set(new String(” “), new String(” “));
}
public Duo(String p, String o) {
set(p, o);
}
public void set(String p, String o) {
this.property = p;
this.object = o;
}
public void setProperty(String p){
this.property = p;
}
public void setObject(String o){
this.object = o;
}
public String getProperty() {
return property;
}
public String getObject() {
return object;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(property);
out.writeUTF(object);
}
@Override
public void readFields(DataInput in) throws IOException {
property = new String(in.readUTF());
object = new String(in.readUTF());
}
/*
@Override
public int hashCode() {
return first.hashCode() * 163 + second.hashCode();
}
@Override
public boolean equals(Object o) {
if (o instanceof Duo) {
Duo tp = (Duo) o;
return first.equals(tp.first) && second.equals(tp.second);
}
return false;
}
*/
@Override
public String toString() {
return property + “\t” + object;
}
/*
@Override
public int compareTo(Duo tp) {
int cmp = first.compareTo(tp.first);
if (cmp != 0) {
return cmp;
}
return second.compareTo(tp.second);
}
*/
}
SInce, we want reducer to create a new file for each unique subject URI, i.e. we have a custom Output Format, therefore we must specify a class for custom Output Format, which is keyBasedMultipleOutputFormat.java. Create a new class KeyBasedMultipleOutptuFormat.java, copy and paste the code below
package bike.hadoop;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.io.*;
public class KeyBasedMultipleOutputFormat extends MultipleTextOutputFormat<Text,Text>{
protected String generateFileNameForKeyValue(Text key, Text value, String name) {
String fileName = “fileName” + key.hashCode();
return fileName;
}
}
we are all set now, to execute the code. But before that, we must create an input file. Create a text file, sop.txt which should be in s o p format. My sop.txt file looks like this
Harshit Kumar Arora
Harshit Rustagi Arora
Harshit Rik Kumar
Gangis khan coming
Gangis khan killer
Gangis khan leaving
This is for testing, actual file should have URIs for subject, predicate and object.
copy to the file to hadoop-0.19.2 folder, open cygwin, change directory to hadoop-0.19.2. Type the followign command to file copy spo.txt to HDFS
bin/hadoop fs -put spo.txt In
Remove other text files that we copied earlier. To remvoe files, type the following command
bin/hadoop fs -rm In/<name of text file>.txt
Now we have a folder In on HDFS, the folder In contains one file spo.txt.
There are 2 ways to execute map-reduce program
1. execute from eclipse – I dont prefer this method
2. make a jar file, copy to hadoop-0.19.2, and execute jar file – preffered approach
To execute from eclipse, just click the run button, and this will create an output folder Outer in HDFS. You can check this by clicking on the DFS locations in Project Explorer.
Alternatively, you can create jar file in Eclipse – CLick on FIle->Export->jar, then enter the name of jar file, I choose jar file name as experiment.jar, just click finish. This will generate a experiment.jar file.
Copy experiment.jar file on to hadoop-0.19.2.
Type the following command to execute jar file
bin/hadoop jar experimet.jar bike.hadoop.Experiment
This will create a Outer folder in HDFS, which you can view using DFS Locations in Project Explorer.
Feel free to comment, and suggest improvements. I know there are many, been rushing thru it.  However, if you want to collaborate, suggest some text, which I will incorporate in it. It will be more useful for reader. I might myself improve it later.

Entry filed under: Uncategorized. Tags: .

Cloud Computing – hadoop An Interesting protocol to communicate train delays using missed calls :)

2 Comments Add your own

  • 1. Narayanan  |  February 19, 2010 at 4:44 pm

    Hi,
    I have an RDF file in XML format where each entry is in b/w rdf:description and /rdf:description.
    How do I perform map-reduce on such RDF files to extract resources, properties and values from entries based on certain keywords?
    Also, are there methods to convert RDF from XML to triples format?

    Reply
    • 2. Harshit Kumar  |  February 22, 2010 at 6:50 am

      You need to make sure that file is not splittable while being processed by mapper.

      You can convert RDF to triple format using JENA API.

      Hope this helps

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Trackback this post  |  Subscribe to the comments via RSS Feed


Recent Posts

Blog Stats

  • 9,462 hits

Top Clicks

  • None

Follow

Get every new post delivered to your Inbox.