Setup Single Node HADOOP cluster

Setting up a Single Node Hadoop Cluster

Objective:

The objective of this Hadoop tutorial is to setup a single node hadoop cluster with a working Namenode, Datanode, Job tracker and Task tracker on the same Virtual machine.

We will list down the list of activities in detail to setup this Hadoop cluster using a single virtual machine.

Audience:

This hadoop tutorial is for those who wish to setup a Hadoop cluster on their local machine to learn the basics of Hadoop ecosystem.

Prerequisites:

Basic awareness of linux commands.

Installation Modes:

There are typicall three installation modes for Hadoop installation. To know more about installation modes, follow the article on installation modes in hadoop.

Stand Alone Hadoop

Pseudo Distributed Hadoop

Distributed Hadoop

----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----

 

 

Installation of CENT OS on a Virtual Machine                                  

The hadoop cluster will reside on a CENT OS machine and for this we will setup a virtual image of CENT OS.

Virtualization Software

You will need to install a virtualization product like virtual box. Vitual box is product by Oracle and can be downloaded from the following link

https://www.virtualbox.org/wiki/Downloads

CENT OS Setup

Once Virtual box is installed then you can download the CENT OS binary files from CENTOS.ORG/DOWNLOAD

For this Hadoop setup we can use the minimal version. For this tutorial I am using the following version

CentOS-6.6-i386-minimal.iso

Creating a new Virtual image

                Click on New to "Create Virtual Machine"

                Choose a Name for the virtual machine, for this tutorial I will use masterslave

                Choose Type as LINUX and

                Red Hat in version

                Select the memory size. You should at least use 512 MB.

                Select the option "Create a Virtual hard drive now"

Select VMDK option in the Hard drive file type so that you can use this image on other    Virtualization software

                Do a next

                And then select the size of hard disk. For this tutorial I will use 8GB.

Network Settings of VM

Once the image is created, go to settings

And then go to Network and select bridged if you want the virtual machine to be connected to internet through the host machine

(You can go through the various network options available on virtual box website)

Installation of CENT OS

Start the Virtual machine and When it asks to select start up disk then choose the CENT OS ISO setup file

You can skip the media test in the next screen

Select OK and then select the keyboard language (You can change the keyboard language later as well)

Select Re initialize all as you are creating a fresh copy

Select the time zone and create a root password.

Select default options of replacing linux and then on next screen do an OK for write changes to disk

Installation will start

Virtual machine will restart after installation

Configurations of CENT OS for HADOOP Insallation                    

Login as Root

Provide “root” without quotes for login name and then the password for root that you created during install

login as: root                                                          

Setting up the ip address for CENT OS machine

When you setup the virtual machine with CENT OS it will not have an ip address assigned to it by default. You may need to do the following in order to set the ip address for the linux machine

Now check the ip address of the CENT OS machine, issue the following command to check the ip address

Ifconfig –a                                                              

 

Now if you don’t see any ipconfig in eth0

Then you can use the following command to start the eth0 ethernet

Ifup eth0                                                                

Again check the ip address for the machine

[root@master ~]# ifconfig -a                                             

eth0      Link encap:Ethernet  HWaddr 08:00:27:86:76:50                  

          inet addr:192.168.0.21  Bcast:192.168.0.255  Mask:255.255.255.0

          inet6 addr: fe80::a00:27ff:fe86:7650/64 Scope:Link             

          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1             

          RX packets:449 errors:0 dropped:0 overruns:0 frame:0           

          TX packets:229 errors:0 dropped:0 overruns:0 carrier:0         

          collisions:0 txqueuelen:1000                                   

          RX bytes:50098 (48.9 KiB)  TX bytes:34474 (33.6 KiB)           

                                                                         

lo        Link encap:Local Loopback                                      

          inet addr:127.0.0.1  Mask:255.0.0.0                            

          inet6 addr: ::1/128 Scope:Host                                 

          UP LOOPBACK RUNNING  MTU:65536  Metric:1                       

          RX packets:0 errors:0 dropped:0 overruns:0 frame:0             

          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0           

          collisions:0 txqueuelen:0                                      

          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)                         

 

But if the machine is restarted the ip address will again be gone.

To prevent this change the ONBOOT=yes  in the file /etc/sysconfig/network-scripts/ifcfg-eth0

Hostname Entry in host file                                                                

We will need to give a Hostname to our Hadoop cluster so that the virtual machine can be identified with a name. By default it will have localhost as the hostname. The hostname will be required for the subseqent steps in setup of the hadoop cluster.

Go through the following article to know how to setup the hostname for the virtual machine setup in the previous steps.

http://mainframewizard.com/content/assigning-hostname-linux-system

Connecting to Linux VM from windows using putty                        

How to connect Linux machine from windows using Putty

The virtual image of CENT OS created using virtual box may not be easy to work with. I prefer to use a client like putty to connect to the virtual machine.

You can download putty from the following link

http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html

And then provide the ip address of the linux machine (how to check the ip address on linux machine) and port as 22

Installation of JAVA on CENT OS machine                                       

You will need JAVA on the virtual machine you have setup up to install the hadoop eco system.

You can use the following command to check if JAVA is installed and the path of installation

[root@master ~]# which java

/usr/bin/java

You can use the following command to install java

Yum install java

Try to execute JPS to see if JVM process status tool runs fine.

[root@master ~]# jps

1560 Jps

But if you are unable to get JPS running then you can look for the correct version of JAVA.

In my case the following version worked

yum install java-1.7.0-openjdk-devel

 

Add a user for HADOOP setup                                                           

Issue command useradd to add new user

[root@master bin]# useradd hadoop

Issue command passwd to set the password for user hadoop

[root@master ~]# passwd hadoop

 

Provide the password

DOWNLOAD HADOOP BINARIES                                                      

Where to get the binaries for hadoop setup?

If you do a google on hadoop download you should get the following link

http://hadoop.apache.org/releases.html  .  

Now Follow the links to download and it should take you to the latest releases page

http://www.apache.org/dyn/closer.cgi/hadoop/common/

But in case you have to install a previous release which does not appear here, then you may want to look into the archives

http://archive.apache.org/dist/hadoop/core/

For this demo I will use Hadoop 1.x so I will take latest stable release for 1.x series which is 1.2.1

And I will download the tar.gz file and will transfer the same to virtual machine and untar it on the centos VM.

I am using hadoop-1.2.1-bin.tar.gz file for this demo

COPY and UN TAR (EXTRACT) THE HADOOP BINARIES TO CENTOS Virtual Machine

Once you have downloaded the hadoop installation files from Apache website, you will need to put these on the CENT OS virtual machine on which you want to setup the hadoop cluster

For this you may need to copy the file using Winscp or some other client.

Once we have the hadoop installation file on the VM, we will un tar it using the following command

tar -xzvf /root/hadoop-1.2.1-bin.tar.gz -C /home/hadoop/

The above command will unzip the contents into a folder named hadoop-1.2.1 under home/hadoop

We can perhaps assign a shortcut to this folder hadoop-1.2.1 so that we can easily access this folder.

Using the following command you can create a pointer to the folder hadoop-1.2.1 as hadoop

ln -s hadoop-1.2.1/ hadoop

Granting the permission to this hadoop folder

We need to grant the access to the hadoop user to the folder where hadoop has been installed .

Execute the following command as root user.

chown -R hadoop:hadoop /home/hadoop/ hadoop-1.2.1/

 

Changing the Configuration Files

1.       Hadoop-env.sh (/home/hadoop/hadoop/conf/hadoop-env.sh)

Change the JAVA_HOME to point to the path where java is installed. You can issue which java to know the path where java is installed.

export JAVA_HOME=/usr

Change the hadoop-env.sh to include the path of hadop installation

At the end add HADOOP_HOME variable to have the value of path where hadoop is installed. The following addition at the end of file.

export HADOOP_HOME=/home/hadoop/hadoop

Change the heap size to a suitable value, for my demo I am changing it to 512

export HADOOP_HEAPSIZE=512

 

Even after making the above changes you are not able to execute hadoop command  directly without giving the full path like /home/hadoop/hadoop/bin/hadoop

Then try to add the following (highlighted) to the bash profile of hadoop user  (vi ~/.bash_profile)

PATH=$PATH:$HOME/bin

PATH/home/hadoop/hadoop/bin:$PATH

export PATH

~

------           STAND ALONE HADOOP INSTALLATION COMPLETED  ------

 

2.       CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml)

Add the following highlighted text between the configuration tags

<configuration>

<property>

 <name>fs.default.name</name>

 <value>hdfs://master.node.com:8020</value>

</property>

</configuration>

 

3.       HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml)

Add the following highlighted text between the configuration tags

Dfs.name.dir is the name of location where name node meta data is stored

Dfs.data.dir is the name of location where data node data is stored

 

<configuration>

<property>

 <name>dfs.name.dir</name>

 <value>/data/namenode</value>

</property>

 

<property>

 <name>dfs.data.dir</name>

 <value>/data/datanode</value>

</property>

 

<property>

 <name>dfs.replication</name>

 <value>1</value>

</property>

</configuration>

 

4.       MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml)

Add the following values to add the job tracker.

 

<configuration>

<property>

 <name>mapred.job.tracker</name>

 <value>master.node.com:50030</value>

</property>

</configuration>

5.       MASTERS(/home/hadoop/hadoop/conf/masters)

Since this is single node cluster, the name and datanode are on the same server, so the master for this node will be itself, so add the hostname for this machine in masters

 

Master.node.com

 

6.       SLAVES(/home/hadoop/hadoop/conf/slaves)

 Since this is single node cluster, the name and datanode are on the same server, so the slaves for this node will be itself, so add the hostname for this machine in masters

 

Master.node.com

 

 

SOME MORE CONFIGURATION CHANGES

 

Namenode, Jobtracker etc can be viewed in browsers once these are started. The following configuration changes before hand will help you when you verify these services from web browser.

 

 

On the VM machine, ensure that the ports are not blocked to communicate with your host machine

You can issue the following command to disable the port blocking

 

[root@master ~]# service iptables stop

 

On the windows host machine ensure that the hostname for the name node (Host name of CENT OS virtual machine) should be added in the host file

 

C:\Windows\System32\drivers\etc\host

 

At the end of the file enter the following (where xxx.xxx.x.xx is the ip address for the virtual machine)

 

xxx.xxx.x.xx master.node.com

 

 

Creating some Directories

 

Ensure these directories are created using the mkdir commands and the user for hadoop installation must have access to these directories.

[root@master /]# mkdir /data

[root@master /]# mkdir /data/namenode/

[root@master /]# mkdir /data/datanode/

Use the following commands to grant the permissions to the hadoop user

[root@master /]# chown -R hadoop:hadoop /data/namenode/

[root@master /]# chown -R hadoop:hadoop /data/datanode/

 

Format the File System

Once the above settings are done properly, we will format the file system.

The following command is used for formatting the hdfs

[hadoop@master ~]$ hadoop namenode –format

You should see a message something like below to ensure that format has run fine

15/02/14 07:54:56 INFO common.Storage: Storage directory /data/namenode has been successfully formatted.

15/02/14 07:54:56 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at master.cluster.com/192.168.0.21

************************************************************/

 

STARTING OPTIONS

Now the setup is complete and we can start the hadoop cluster. There are two options either you can start all the services in one go or each service one by one.

 

Since this is the first time we are setting up this cluster its better to start the services one by one and by verification of each services as we start them.

 

Starting the Name node & Verification of Name node

 

Issue the following command to start only the namenode

 

[hadoop@master ~]$ hadoop-daemon.sh start namenode

 

Issue a JPS to see if the name node has come up

 

[hadoop@master ~]$ jps

2527 NameNode

2589 Jps

 

Start a web browser session and open the following link

 

                http://master.node.com:50070/dfshealth.jsp

Starting the Data node & Verification of data node

 

Issue the following command to start only the data node

 

[hadoop@master ~]$ hadoop-daemon.sh start datanode

 

Issue a JPS to see if the data node has come up

 

[hadoop@master ~]$ jps

2527 NameNode

2714 Jps

2651 DataNode

 

Start a web browser session and open the following link

                http://master.node.com:50070/dfshealth.jsp

Starting  Jobtracker, tasktracker and Secondarynamenode and verification

Issue the following commands to start Jobtracker, Tasktracker and Secondary Name node respectively

 

[hadoop@master ~]$ hadoop-daemon.sh start jobtracker

[hadoop@master ~]$ hadoop-daemon.sh start tasktracker

[hadoop@master ~]$ hadoop-daemon.sh start secondarynamenode

 

Issue a JPS to see if the above services came up

 

 

[hadoop@master ~]$ jps

3097 Jps

2527 NameNode

2748 JobTracker

3026 SecondaryNameNode

2651 DataNode

2862 TaskTracker

[hadoop@master ~]$

 

Goto the following link in browser to see the job tracker

 

                http://master.node.com:50030/jobtracker.jsp

----  PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED----