We will see step by step process to install and configure a multi-node Hadoop 1.x setup.
Cluster & User
Suppose we want to setup a cluster of 5 machines.
We want to setup the cluster with the machine 192.168.0.1 as master node and all the machines as slave nodes. The master node is the one on which JobTracker and Namenode daemons will run. Datanode and TaskTracker daemons will run on all the slaves.
We will have a dedicated user for the hadoop cluster. Create/have a user e.g. ‘hadoopUser’ on all the machines.
The Hadoop version we will install and configure is – hadoop-1.2.1
Download from – http://www.eu.apache.org/dist/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
Extract the content of hadoop-1.2.1.tar.gz.
tar -xf hadoop-1.2.1.tar.gz
All the contents of the tar file will be extracted in ‘hadoop-1.2.1’ directory
Setting the Password-less SSH
We need to setup password-less SSH from master node to all the slave nodes. This is required for the master node in the Hadoop cluster to start daemons on the slave nodes.
- Make sure that your are logged-in into the master node using the ‘hadoopUser’. Go to the home directory for the hadoopUser.
- Generate the public/private rsa key pair by running following command. You can keep the passphrase empty and hit enter when prompted to enter the passphrase.
ssh-keygen -t rsa
- The keys are generated in step 2. Now add the public key to the authorized_keys of your own machine for setting up passwordless ssh to localhost.
cat .ssh/id_rsa.pub >> .ssh/authorized_keys
- Try a password-less login into the localhost. The login should not prompt a password.
- Now distribute the public key to all the other slave machines. Add the public key to the authorized_keys of slave machines to which we want to do password-less ssh from our master machine.
cat .ssh/id_rsa.pub | ssh hadoopUser@192.168.0.2 'cat >> /home/hadoopUser/.ssh/authorized_keys' cat .ssh/id_rsa.pub | ssh hadoopUser@192.168.0.3 'cat >> /home/hadoopUser/.ssh/authorized_keys' cat .ssh/id_rsa.pub | ssh hadoopUser@192.168.0.4 'cat >> /home/hadoopUser/.ssh/authorized_keys' cat .ssh/id_rsa.pub | ssh hadoopUser@192.168.0.5 'cat >> /home/hadoopUser/.ssh/authorized_keys'
- Try a password-less login into the slave machines. The login should not prompt a password.
ssh 192.168.0.2 (login successful) exit ssh 192.168.0.3 (login successful) exit ssh 192.168.0.4 (login successful) exit ssh 192.168.0.5 (login successful) exit
ADD IP address and hostname entries to /etc/hosts file
We need to add the IP address and hostname mappings to the /etc/hosts file on all the machines in our cluster. This is needed for the machines to identify each other with the hostname. To get the hostname of machine just type ‘hostname’ command on the shell. Get the hostname of all the machines in the cluster and add the mapping for the IP addresses. For e.g.
192.168.0.1 machine1 192.168.0.2 machine2 192.168.0.3 machine3 192.168.0.4 machine4 192.168.0.5 machine5
Now we are ready to configure our hadoop cluster.
- On all the nodes create a directory to hold hadoop data. This directory will hold the HDFS filesystem. Give 755 permission to this directory on all the machines.
mkdir /home/hadoopUser/hadoop_data chmod -R 755 /home/hadoopUser/hadoop_data
- Edit hadoop-1.2.1/conf/core-site.xml file. Add the following to the core-site.xml in the <configuration> tag:-
<property> <name>fs.default.name</name> <value>hdfs://192.168.0.1:54310</value> <description>Default URI for all filesystem requests in Hadoop. A filesystem path in Hadoop has two main components: a URI that identifies the filesystem, and a path specifying the location of a file/directory in the filesystem.</description> </property> <property> <name>hadoop.tmp.dir</name> <value>/home/hadoopUser/hadoop_data</value> <description>The base directory under which hadoop stores all its data like namenode metadata, HDFS datablocks etc</description> </property>
- Edit hadoop-1.2.1/conf/hdfs-site.xml file. Add the following to the hdfs-site.xml in the <configuration> tag:-
<property> <name>dfs.replication</name> <value>3</value> <description> Default number of block replications for HDFS blocks. </description> </property>
- Edit hadoop-1.2.1/conf/mapred-site.xml file. Add the following to the mapred-site.xml in the <configuration> tag:-
<property> <name>mapred.job.tracker</name> <value>192.168.0.1:54311</value> <description> The host and port that the MapReduce job tracker runs at. </description> </property>
- Edit hadoop-1.2.1/conf/hadoop-env.sh file and add the entry for JAVA_HOME variable. Point JAVA_HOME to the directory where Java is installed. for e.g.
- Edit hadoop-1.2.1/conf/masters file and add the ip address of our master machine to it.
- Edit hadoop-1.2.1/conf/slaves file and add the ip addresses of all our slave machines to it.
192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4 192.168.0.5
We are done with the configurations.
Format the NameNode
On the master node go to the hadoop-1.2.1/bin directory and type following command to format the namenode.
hadoop namenode -format
Start the Cluster
Now that we have all the things in place, we can now start the hadoop cluster by running the following script from hadoop-1.2.1/bin directory. This script will start the namenode, secondary-namenode, jobtracker on the master node and datanode and tasktracker on all the slave nodes.
Once the script executes successfully our hadoop cluster is UP and RUNNING.
The Namenode UI is up at : – http://192.168.0.1:50070/dfshealth.jsp – You can browse the filesystem from this UI. This URL should be accessible and Live Nodes mentioned on the page should be 5.
The Jobtracker UI is up at :- http://192.168.0.1:50030/jobtracker.jsp – This UI shows the status and history of submitted map-reduce jobs. This URL should be accessible and Nodes mentioned on the page should be 5.