Multi Node hadoop set-up
- Assign host name to master,slaves and client
- Install ssh
-"sudo yum install openssh-server" - Generate .ssh password less key
-"ssh-keygen"
-press Enter - press Enter - Now we should add IP address and host name of the master,slave and client machine in /etc/hosts file
- Share the generated key across all slave nodes
- "ssh-copy-id -i $HOME/.ssh/id.rsa.pub <user>@<hostname>"
- Setup JAVA on a Master Node
- download JAVA from oracle and export the JAVA_HOME and PATH in .bashrc file then "source .bashrc"
- Setup hadoop on Master Node lets say /usr/local/hadoop is the HADOOP_HOME
- download Hadoop from apache and export HADOOP_HOME and PATH in .bashrc file then "source .bashrc"
- Now we need to configure 6 files in $HADOOP_HOME/conf/ directory
- -"sudo nano /usr/local/hadoop/conf/hadoop-env.sh"
-add JAVA_HOME (export JAVA_HOME=/usr/local/java/jdk1.7.0_75) - "sudo nano /usr/local/hadoop/conf/core-site.xml"
-add below entries, Note that NameNode is the host name of the machine where NameNode process of the hadoop HDFS is running. In my case NameNode is the master node Host name.
<property>
<name>hadoop.tmp.dir</name>
<value>/usr/local/hadoop/tmp</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://NameNode:54310</value>
</property>
</configuration>
- sudo mkdir -p /usr/local/hadoop/tmp
- "sudo nano /usr/local/hadoop/conf/mapred-site.xml"
-add below entries ,Note that NameNode is the host name of the machine where JobTracker process of the hadoop MapReduce is running.<configuration><property><name>mapred.job.tracker</name><value>NameNode:54311</value></property></configuration> - "sudo nano /usr/local/hadoop/conf/hdfs-site.xml"<configuration><property><name>dfs.replication</name><value>1</value></property><property><name>dfs.block.size</name><value>67108864</value></property><configuration>
- Now we need to configure master file, here just we need to do is add the host name of the node where namenode and jobtracker process will run
- Similarly add the hostnames in slaves file, so that datanode and tasktracker process will runs on those slave machine
o "rsync -avg /usr/local/hadoop/ <user>@<hosatname>/usr/local/hadoop/"
-Note: here i followed the same directory structure in both master and slave machine
10. Similarly we can also sync the JAVA from master machine with the slave machines
o"rsync -avg /usr/local/java/jdk1.7.0_75/ <user>@<hosatname>/usr/local/java/jdk1.7.0_75/"
-Note : here i maintained the same directory so no need of changing the JAVA_HOME from /usr/local/hadoop/conf/hadoop-env.sh,
o But JAVA_HOME and HADOOP_HOME should be set properly in slave machine or we can use rsync for .bashrc file as well.
11. Now delete /usr/local/hadoop/tmp from all the nodes and from master node run
"hadoop namenode -format"
12. Now we can star our hadoop muli node set-up by running "start-all.sh" from master machine
Comments
Post a Comment