Hadoop stuffs....:)

Posts

Showing posts from 2015

TFIDF code

April 05, 2015

TFIDF is 3 pair of map-reduce program TF&DF can be achieved from simple map-reduce tasks, here posted the third map reduce task to find out the TFIDF final deliverables. Custom Key public class Key implements WritableComparable<Key>{ Text word; IntWritable type; public Key() { word = new Text(); type = new IntWritable(); } public Text getWord() { return word; } public void setWord(Text word) { this.word = word; } public IntWritable getType() { return type; } public void setType(IntWritable type) { this.type = type; } @Override public void readFields(DataInput arg0) throws IOException { // TODO Auto-generated method stub this.word.readFields(arg0); this.type.readFields(arg0); } @Override public void write(DataOutput arg0) throws IOException { // TODO Auto-generated method stub this.word.write(arg0); this.type.write(arg0); } @...

Reduce Side Join (Secondary Sorting) program code.

April 05, 2015

Custom Key public class EmployeeKey implements WritableComparable<EmployeeKey>{ Text StateID; IntWritable type; public Text getStateID() { return StateID; } public void setStateID(Text stateID) { StateID = stateID; } public IntWritable getType() { return type; } public void setType(IntWritable type) { this.type = type; } public EmployeeKey() { // TODO Auto-generated constructor stub StateID = new Text(); type =new IntWritable(); } @Override public void readFields(DataInput in) throws IOException { // TODO Auto-generated method stub this.StateID.readFields(in); this.type.readFields(in); } @Override public void write(DataOutput out) throws IOException { // TODO Auto-generated method stub this.StateID.write(out); this.type.write(out); } @Override public int compareTo(EmployeeKey o) { // TODO Auto-generated method stub int cmp = 0; cmp = this.StateID.compareTo(o.getStateID()); if(cmp =...

Failover and fencing

April 05, 2015

1.The transition from the active namenode to the standby is managed by a new entity in the system called the failover controller. 2.Failover controllers are pluggable, but the first implementation uses ZooKeeper to ensure that only one namenode is active. 3.Each namenode runs a lightweight failover controller process whose job it is to monitor its namenode for failures (using a simple heartbeating mechanism) and trigger a failover should a namenode fail. 4.Failover may also be initiated manually by an adminstrator, in the case of routine maintenance, for example. This is known as a graceful failover, since the failover controller arranges an orderly transition for both namenodes to switch roles. 5.In the case of an ungraceful failover, however, it is impossible to be sure that the failed namenode has stopped running. For example, a slow network or a network partition can trigger a failover transition, even though the previously active namenode is still runn...

Secondary NameNode check-pointing process

April 05, 2015

Secondary namenode, whose purpose is to produce checkpoints of the primary’s in-memory filesystem metadata. The check pointing process proceeds as follows, 1. The secondary asks the primary to roll its edits file, so new edits go to a new file. 2. The secondary retrieves fsimage and edits from the primary (using HTTP GET). 3. The secondary loads fsimage into memory, applies each operation from edits, then creates a new consolidated fsimage file. 4. The secondary sends the new fsimage back to the primary (using HTTP POST). 5. The primary replaces the old fsimage with the new one from the secondary, and the old edits file with the new one it started in step 1. It also updates the fstime file to record the time that the checkpoint was taken. At the end of the process, the primary has an up-to-date fsimage file and a shorter edits file (it is not necessarily empty, as it may have received some edits while the checkpoint was being taken). It is possible for an administrator...

MapReduce in HADOOP 1

March 01, 2015

MapReduce in HADOOP 2

March 01, 2015

Submission Step 1 in figure: The submit() method on Job creates an internal Jobsubmitter instance and calls submitJobInternal() on it.The job submission process implemented by Jobsubmitter does the following. Step 2: Asks the resource manager a new job Id. Step 3: Checks the output specification of the job,Computes input splits,Copies job resources (job JAR ,configuration,and split information)to HDFS. Step 4: Finally, the job is submitted by calling submitApplication() on the resource manager. Job Intitialization Step 5a and 5b: When the resource manager receives a call to its submitApplication(), it hands off the request to the scheduler.The scheduler allocates a container,and the resource manager then launches the application master's process there , under the node manager's management. Step 6: The application master initializes job by creating a number of book keeping objects to keep track of the job's progress,as it will receive progr...

Hadoop 1 Vs Hadoop 2

February 26, 2015

Limitation of hadoop 1 ü No horizontal scalability of NameNode · As a cluster size grows there is a bottleneck for NameNode metadata ü Does not supports NameNode High-availability · If NameNode is single point of contact, if it fails then whole HDFS process is down ü Overburdened Job Tracker · As Schedule job, Monitor Job, if Task Tracker fails then reschedule, Resource manage are done by this process alone ü Does not support multi-tenancy · Means different(Map-reduce, Streaming job, Interactive job) jobs on same resource at a same time Hadoop 2 1. HDFS Federation · We can have more than one NameNode’s each NameNode will have namespace associated with it. · ...

Setup Trash in hadoop

February 18, 2015

To set up trash in hadoop you just need to do is, set fs.trash.interval and fs.trash.checkpoint.interval in /usr/local/hadoop/conf/core-site.xml file First upload file to your HDFS (ex. hadoop fs -put shakespeare shakee) Delete file from HDFS (ex. hadoop fs -rmr shakee ) Reload HDFS page in browser then you will see .Trash directory Below is the example configuration, <property> <name>fs.trash.interval</name> <value>3</value> </property> <property> <name>fs.trash.checkpoint.interval</name> <value>1</value> </property> Here fs.trash.interval value is 3 which means deleted files which are in .Trash folder will get removed permanently in 3 minutes. fs.trash.checkpoint.interval value is 1, which means for every one minute check is performed and deletes all files that ...