Posts

Showing posts from February, 2015

Hadoop 1 Vs Hadoop 2

Limitation of hadoop 1      ü No horizontal scalability of NameNode ·         As a cluster size grows there is a bottleneck for NameNode metadata   ü Does not supports NameNode High-availability ·         If NameNode is single point of contact, if it fails then whole HDFS process is down   ü Overburdened Job Tracker ·         As Schedule job, Monitor Job, if Task Tracker fails then reschedule, Resource manage are done by this process alone   ü Does not support multi-tenancy ·         Means different(Map-reduce, Streaming job, Interactive job) jobs on same resource at a same time Hadoop 2     1.      HDFS Federation ·         We can have more than one NameNode’s each NameNode will have namespace associated with it. · ...

Setup Trash in hadoop

To set up trash in hadoop you just need to do is, set fs.trash.interval and fs.trash.checkpoint.interval in /usr/local/hadoop/conf/core-site.xml file   First upload file to your HDFS (ex.  hadoop fs -put shakespeare shakee) Delete file from HDFS (ex.  hadoop fs -rmr shakee ) Reload HDFS page in browser then you will see .Trash directory Below is the example configuration, <property>     <name>fs.trash.interval</name>     <value>3</value> </property> <property>     <name>fs.trash.checkpoint.interval</name>     <value>1</value> </property> Here  fs.trash.interval value is 3 which means deleted files which are in .Trash folder will get removed permanently in 3 minutes.  fs.trash.checkpoint.interval  value is 1, which means for every one minute check is performed and deletes all files that ...

Problems exception faced in hadoop set up

1.   Data node in not starting namespace ID mismatch.  Reason:   Every time the NameNode reformats it generats a new namespace ID, and some of your machines didn't get the memo.  Sloution:  The easiest way to fix this problem is by reformatting the distributed filesystem. Delete hadoop temp folder  (/usr/local/hadoop/hadoop-tmp)on each server (the directory path may be different depending on your configuration). Then run  hadoop namenode -format from your master server. After that, start-all.sh  should do the trick. 2 . Hadoop master and Slave version must be same 

Multi Node hadoop set-up

Assign host name to master,slaves and client  Install ssh     -" sudo yum install openssh-server " Generate .ssh password less key      -" ssh-keygen "     -press Enter - press Enter  Now we should add IP address and host name of the master,slave and client machine in /etc/hosts file  Share the generated key across all slave nodes   " ssh-copy-id -i $HOME/.ssh/id.rsa.pub <user>@<hostname> " Setup JAVA on a Master Node download JAVA from oracle and export the JAVA_HOME and PATH in .bashrc file then "source .bashrc" Setup hadoop on Master Node lets say /usr/local/hadoop is the HADOOP_HOME download Hadoop from apache and export HADOOP_HOME and PATH in .bashrc file then "source .bashrc" Now we need to configure 6 files in $HADOOP_HOME/conf/ directory -" sudo nano  /usr/local/hadoop/conf/ hadoop-env.sh "  -add JAVA_HOME ( export  JAVA_HOME=/usr/local/java/jdk1.7.0_75 ) " sudo nano  ...

setting up single node hadoop cluster on Ubuntu

1. Install ssh     -" sudo apt-get install openssh-server " 2. Generate .ssh password less key      -" ssh-keygen "     -press Enter - press Enter       (You can try " ssh localhost " but this will ask you for password) 3. Copy the Generated key to Authorized_keys, To prevent the password       -" cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys "      (Now you can try "ssh localhost", this won't ask you for password any more ) 4. Download and Install Oracle JAVA          -Download java from oracle, in my case it is "jdk-7u75-linux-x64.tar.gz"      -Extract it by " tar -zxvf jdk-7u75-linux-x64.tar.gz "      -Make directory " sudo mkdir -p /usr/local/java "      -Move it to java directory " sudo mv jdk1.7.0_75 /usr/local/java " 5. Download Apache Hadoop from " http://hadoop.apache.org/releas...