MapReduce in HADOOP 2





Submission


  • Step 1 in figure:The submit() method on Job creates an internal Jobsubmitter instance and calls submitJobInternal() on it.The job submission process implemented by Jobsubmitter does the following.
  • Step 2:Asks the resource manager a new job Id.
  • Step 3:Checks the output specification of the job,Computes input splits,Copies job resources (job JAR ,configuration,and split information)to HDFS.
  • Step 4:Finally, the job is submitted by calling submitApplication() on the resource manager.

Job Intitialization


  • Step 5a and 5b: When the resource manager receives a call to its submitApplication(), it hands off the request to the scheduler.The scheduler allocates a container,and the resource manager then launches the application master's process there , under the node manager's management.
  • Step 6:The application master initializes job by creating a number of book keeping objects to keep track of the job's progress,as it will receive progress and completion reports from the tasks.
  • Step 7:Then it receives the input splits computed in the client from the shared filesystem.



Task Assignment


  • Step 8: If the job does not qualify for running as uber task, then the application master requestes containers for all the map and reduce rasks in the job from resource manager.
Note: All requests includes inforamation about each map task's data locality, in particular the hosts and coressponding racks that the input split resides on.The scheduler uses this info to make scheduling decesions.How? It attempts to place tasks on data-local nodes(in the ideal case), but if this is not possible, it prefers rack-local placement to non=local placement.


Task Execution


  • Step 9a and 9b: After a container has been assigned to the task by resource manager's scheduler, the application master starts the container by contacting the node manager.
  • Step 10:The task is executed by a Java application whose main class is YarnChild. Before running the task it localizes the resources that task needs, which includes the job configuration and JAR file and any files from distributed cache.
  • Step 11: Finally, it runs the map or reduce task
Note: Unlike MapReduce 1 YARN does not suppoert JVM reuse,so each task runs in a new JVM.Streaming and Pipes programs work in the same way as MapReduce 1.The YarnChild launches the Streaming or Pipes process and communicates with it using standard input/output or a socket(respectively).


Progress and status updates


  • When running under YARN , the task reports its progress and status back to its application master,which has an aggregate view of the job,every three seconds over the umbilical interface.The clients polls the appkiaction master every second to receive progress updates,which are usually displayed to the user.



Job Completion


  • Every five seconds the client checks whether the job has completed by calling thewaitForCompletion() method on Job.The polling interval can be set via themapreduce.client.completon.pollinterval configuration property. On job completion,the applcation master and the task containers clean up their working stste , and the OutputCommitter's job cleanup method is called.Job information is archived by the job history server to enable later interrogation b users if desired.

Comments

  1. Marvelous blog with tons of valuable information. I gathered some useful information.
    Big Data training in Chennai

    ReplyDelete
  2. Nice article,thank you for sharing this awesome blog with us.

    big data hadoop training

    ReplyDelete

Post a Comment

Popular posts from this blog

Secondary NameNode check-pointing process

Failover and fencing

Hadoop 1 Vs Hadoop 2