MapReduce in HADOOP 1
Submission
- Step 1 in figure:The submit() method on Job creates an internal Jobsubmitter instance and calls submitJobInternal() on it.The job submission process implemented by Jobsubmitter does the following.
- Step 2:Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker)
- Step 3:Checks the output specification of the job,Computes input splits,Copies job resources (job JAR ,configuration,and split information)to HDFS.
- Step 4:Finally, the job is submitted by,Tells the jobtracker that the job is ready for execution (by calling submitJob() on JobTracker)
Job Intitialization
- Step 5: When the JobTracker receives a call to its submitJob() method, it puts it into an internal queue from where the job scheduler will pick it up and initialize it. Initialization involves creating an object to represent the job being run, which encapsulates its tasks, and bookkeeping information to keep track of the tasks’ status and progress
- Step 6: To create the list of tasks to run, the job scheduler first retrieves the input splits computed by the client from the shared filesystem . It then creates one map task for each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks property in the Job, which is set by the setNumReduceTasks() method, and the scheduler simply creates this number of reduce asks to be run. Tasks are given IDs at this point. In addition to the map and reduce tasks, two further tasks are created: a job setup task and a job cleanup task. These are run by tasktrackers and are used to run code to setup the job before any map tasks run, and to cleanup after all the reduce tasks are complete.
Task Assignment
- Step 7: Tasktrackers run a simple loop that periodically sends heartbeat method calls to the jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a channel for messages. As a part of the heartbeat, a tasktracker will indicate whether it is ready to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to the tasktracker using the heartbeat return value
Task Execution
- Step 8: First, It localizes the job JAR by copying it from the shared filesystem to the tasktracker’s filesystem. It also copies any files needed from the distributed cache by the application to the local disk. Second, it creates a local working directory for the task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of TaskRunner to run the task.
- Step 9: TaskRunner launches a new Java Virtual Machine to run each task in so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker.
- Step 10: Finally, it runs the map or reduce task
Progress and status updates
- When a task is running, it keeps track of its progress, that is, the proportion of the task completed. For map tasks, this is the proportion of the input that has been processed. For reduce tasks, it’s a little more complex, but the system can still estimate the proportion of the reduce input processed. It does this by dividing the total progress into three parts, corresponding to the three phases of the shuffle. For example, if the task has run the reducer on half its input, then the task’s progress is ⅚, since it has completed the copy and sort phases (⅓ each) and is halfway through the reduce phase (⅙).
Job Completion
- When the jobtracker receives a notification that the last task for a job is complete (this will be the special job cleanup task), it changes the status for the job to “successful.” Then, when the Job polls for status, it learns that the job has completed successfully, so it prints a message to tell the user and then returns from the waitForCompletion() method.
- The jobtracker also sends an HTTP job notification if it is configured to do so. This can be configured by clients wishing to receive callbacks, via the job.end.notification.url property.
- Last, the jobtracker cleans up its working state for the job and instructs tasktrackers to do the same (so intermediate output is deleted, for example).
Thanks for your post which gather more knowledge about the topic. I read your blog everything is helpful and effective.
ReplyDeleteHadoop training in chennai
Big data training in chennai
Big data course in chennai
Big data training in velachery
Big data analytics courses in chennai
Big data analytics training in chennai
Big data analytics training in Anna Nagar
Big data training in chennai anna nagar