Multi Node hadoop set-up



  1. Assign host name to master,slaves and client 
  2. Install ssh
        -"sudo yum install openssh-server"
  3. Generate .ssh password less key 
        -"ssh-keygen"
        -press Enter - press Enter 
  4. Now we should add IP address and host name of the master,slave and client machine in /etc/hosts file 
  5. Share the generated key across all slave nodes
    •   "ssh-copy-id -i $HOME/.ssh/id.rsa.pub <user>@<hostname>"
  6. Setup JAVA on a Master Node
    • download JAVA from oracle and export the JAVA_HOME and PATH in .bashrc file then "source .bashrc"
  7. Setup hadoop on Master Node lets say /usr/local/hadoop is the HADOOP_HOME
    • download Hadoop from apache and export HADOOP_HOME and PATH in .bashrc file then "source .bashrc"
  8. Now we need to configure 6 files in $HADOOP_HOME/conf/ directory
    • -"sudo nano /usr/local/hadoop/conf/hadoop-env.sh"
       -add JAVA_HOME (export JAVA_HOME=/usr/local/java/jdk1.7.0_75)
    • "sudo nano /usr/local/hadoop/conf/core-site.xml"
       -add below entries, Note that NameNode  is the host name of the machine where  NameNode process of the hadoop HDFS is running. In my case NameNode is the master node Host name.
                     <configuration>
                         <property> 
                              <name>hadoop.tmp.dir</name>
                              <value>/usr/local/hadoop/tmp</value>
                       </property> 

                       <property> 
                              <name>fs.default.name</name> 
                              <value>hdfs://NameNode:54310</value>
                       </property> 
                   </configuration>
             
    • sudo mkdir -p /usr/local/hadoop/tmp
    • "sudo nano /usr/local/hadoop/conf/mapred-site.xml"
       -add below entries ,Note that NameNode  is the host name of the machine where JobTracker process of the hadoop MapReduce is running. 
       <configuration>
              <property> 
                      <name>mapred.job.tracker</name> 
                            <value>NameNode:54311</value>
              </property> 
       </configuration>
    • "sudo nano /usr/local/hadoop/conf/hdfs-site.xml"
       <configuration>
                <property>
                    <name>dfs.replication</name>
                    <value>1</value>
                </property>
                <property>
                    <name>dfs.block.size</name>
                    <value>67108864</value>
                </property>   
       <configuration>
    • Now we need to configure master file, here just we need to do is add the host name of the node where namenode and jobtracker process will run
    • Similarly add the hostnames in slaves file, so that datanode and tasktracker process will runs on those slave machine 
        9. Now we are ready with master machine so i will use the rsync command to sych up                  hadoop with the slave nodes and client node               
                  o   "rsync -avg /usr/local/hadoop/ <user>@<hosatname>/usr/local/hadoop/"
              -Note: here i followed the same directory structure in both master and slave machine
       
       10. Similarly we can also sync the JAVA from master machine with the slave machines 
o"rsync -avg /usr/local/java/jdk1.7.0_75/ <user>@<hosatname>/usr/local/java/jdk1.7.0_75/"
              -Note : here i maintained the same directory so no need of changing the JAVA_HOME        from /usr/local/hadoop/conf/hadoop-env.sh,
          o  But JAVA_HOME and HADOOP_HOME should be set properly in slave machine or we can  use rsync for .bashrc file as well.

      11. Now delete /usr/local/hadoop/tmp from all the nodes and from master node run 
               "hadoop namenode -format"
      
      12. Now we can star our hadoop muli node set-up by running "start-all.sh" from                      master machine 
      

              

































Comments

Popular posts from this blog

Secondary NameNode check-pointing process

Failover and fencing

Hadoop 1 Vs Hadoop 2