Setup Single Node HADOOP cluster
Setting up a Single Node Hadoop Cluster
Objective:
The objective of this Hadoop tutorial is to setup a single node hadoop cluster with a working Namenode, Datanode, Job tracker and Task tracker on the same Virtual machine.
We will list down the list of activities in detail to setup this Hadoop cluster using a single virtual machine.
Audience:
This hadoop tutorial is for those who wish to setup a Hadoop cluster on their local machine to learn the basics of Hadoop ecosystem.
Prerequisites:
Basic awareness of linux commands.
Installation Modes:
There are typicall three installation modes for Hadoop installation. To know more about installation modes, follow the article on installation modes in hadoop.
Stand Alone Hadoop
Pseudo Distributed Hadoop
Distributed Hadoop
----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----
Installation of CENT OS on a Virtual Machine
The hadoop cluster will reside on a CENT OS machine and for this we will setup a virtual image of CENT OS.
Virtualization Software
You will need to install a virtualization product like virtual box. Vitual box is product by Oracle and can be downloaded from the following link
https://www.virtualbox.org/wiki/Downloads
CENT OS Setup
Once Virtual box is installed then you can download the CENT OS binary files from CENTOS.ORG/DOWNLOAD
For this Hadoop setup we can use the minimal version. For this tutorial I am using the following version
CentOS-6.6-i386-minimal.iso
Creating a new Virtual image
Click on New to "Create Virtual Machine"
Choose a Name for the virtual machine, for this tutorial I will use masterslave
Choose Type as LINUX and
Red Hat in version
Select the memory size. You should at least use 512 MB.
Select the option "Create a Virtual hard drive now"
Select VMDK option in the Hard drive file type so that you can use this image on other Virtualization software
Do a next
And then select the size of hard disk. For this tutorial I will use 8GB.
Network Settings of VM
Once the image is created, go to settings
And then go to Network and select bridged if you want the virtual machine to be connected to internet through the host machine
(You can go through the various network options available on virtual box website)
Installation of CENT OS
Start the Virtual machine and When it asks to select start up disk then choose the CENT OS ISO setup file
You can skip the media test in the next screen
Select OK and then select the keyboard language (You can change the keyboard language later as well)
Select Re initialize all as you are creating a fresh copy
Select the time zone and create a root password.
Select default options of replacing linux and then on next screen do an OK for write changes to disk
Installation will start
Virtual machine will restart after installation
Configurations of CENT OS for HADOOP Insallation
Login as Root
Provide “root” without quotes for login name and then the password for root that you created during install
login as: root
Setting up the ip address for CENT OS machine
When you setup the virtual machine with CENT OS it will not have an ip address assigned to it by default. You may need to do the following in order to set the ip address for the linux machine
Now check the ip address of the CENT OS machine, issue the following command to check the ip address
Ifconfig –a
Now if you don’t see any ipconfig in eth0
Then you can use the following command to start the eth0 ethernet
Ifup eth0
Again check the ip address for the machine
[ ~]# ifconfig -a
eth0 Link encap:Ethernet HWaddr 08:00:27:86:76:50
inet addr:192.168.0.21 Bcast:192.168.0.255 Mask:255.255.255.0
inet6 addr: fe80::a00:27ff:fe86:7650/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:449 errors:0 dropped:0 overruns:0 frame:0
TX packets:229 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:50098 (48.9 KiB) TX bytes:34474 (33.6 KiB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
But if the machine is restarted the ip address will again be gone.
To prevent this change the ONBOOT=yes in the file /etc/sysconfig/network-scripts/ifcfg-eth0
Hostname Entry in host file
We will need to give a Hostname to our Hadoop cluster so that the virtual machine can be identified with a name. By default it will have localhost as the hostname. The hostname will be required for the subseqent steps in setup of the hadoop cluster.
Go through the following article to know how to setup the hostname for the virtual machine setup in the previous steps.
https://mainframewizard.com/content/assigning-hostname-linux-system
Connecting to Linux VM from windows using putty
How to connect Linux machine from windows using Putty
The virtual image of CENT OS created using virtual box may not be easy to work with. I prefer to use a client like putty to connect to the virtual machine.
You can download putty from the following link
http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html
And then provide the ip address of the linux machine (how to check the ip address on linux machine) and port as 22
Installation of JAVA on CENT OS machine
You will need JAVA on the virtual machine you have setup up to install the hadoop eco system.
You can use the following command to check if JAVA is installed and the path of installation
[ ~]# which java
/usr/bin/java
You can use the following command to install java
Yum install java
Try to execute JPS to see if JVM process status tool runs fine.
[ ~]# jps
1560 Jps
But if you are unable to get JPS running then you can look for the correct version of JAVA.
In my case the following version worked
yum install java-1.7.0-openjdk-devel
Add a user for HADOOP setup
Issue command useradd to add new user
[ bin]# useradd hadoop
Issue command passwd to set the password for user hadoop
[ ~]# passwd hadoop
Provide the password
DOWNLOAD HADOOP BINARIES
Where to get the binaries for hadoop setup?
If you do a google on hadoop download you should get the following link
http://hadoop.apache.org/releases.html .
Now Follow the links to download and it should take you to the latest releases page
http://www.apache.org/dyn/closer.cgi/hadoop/common/
But in case you have to install a previous release which does not appear here, then you may want to look into the archives
http://archive.apache.org/dist/hadoop/core/
For this demo I will use Hadoop 1.x so I will take latest stable release for 1.x series which is 1.2.1
And I will download the tar.gz file and will transfer the same to virtual machine and untar it on the centos VM.
I am using hadoop-1.2.1-bin.tar.gz file for this demo
COPY and UN TAR (EXTRACT) THE HADOOP BINARIES TO CENTOS Virtual Machine
Once you have downloaded the hadoop installation files from Apache website, you will need to put these on the CENT OS virtual machine on which you want to setup the hadoop cluster
For this you may need to copy the file using Winscp or some other client.
Once we have the hadoop installation file on the VM, we will un tar it using the following command
tar -xzvf /root/hadoop-1.2.1-bin.tar.gz -C /home/hadoop/
The above command will unzip the contents into a folder named hadoop-1.2.1 under home/hadoop
We can perhaps assign a shortcut to this folder hadoop-1.2.1 so that we can easily access this folder.
Using the following command you can create a pointer to the folder hadoop-1.2.1 as hadoop
ln -s hadoop-1.2.1/ hadoop
Granting the permission to this hadoop folder
We need to grant the access to the hadoop user to the folder where hadoop has been installed .
Execute the following command as root user.
chown -R hadoop:hadoop /home/hadoop/ hadoop-1.2.1/
Changing the Configuration Files
1. Hadoop-env.sh (/home/hadoop/hadoop/conf/hadoop-env.sh)
Change the JAVA_HOME to point to the path where java is installed. You can issue which java to know the path where java is installed.
export JAVA_HOME=/usr
Change the hadoop-env.sh to include the path of hadop installation
At the end add HADOOP_HOME variable to have the value of path where hadoop is installed. The following addition at the end of file.
export HADOOP_HOME=/home/hadoop/hadoop
Change the heap size to a suitable value, for my demo I am changing it to 512
export HADOOP_HEAPSIZE=512
Even after making the above changes you are not able to execute hadoop command directly without giving the full path like /home/hadoop/hadoop/bin/hadoop
Then try to add the following (highlighted) to the bash profile of hadoop user (vi ~/.bash_profile)
PATH=$PATH:$HOME/bin
PATH/home/hadoop/hadoop/bin:$PATH
export PATH
~
------ STAND ALONE HADOOP INSTALLATION COMPLETED ------
2. CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml)
Add the following highlighted text between the configuration tags
3. HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml)
Add the following highlighted text between the configuration tags
Dfs.name.dir is the name of location where name node meta data is stored
Dfs.data.dir is the name of location where data node data is stored
4. MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml)
Add the following values to add the job tracker.
5. MASTERS(/home/hadoop/hadoop/conf/masters)
Since this is single node cluster, the name and datanode are on the same server, so the master for this node will be itself, so add the hostname for this machine in masters
Master.node.com
6. SLAVES(/home/hadoop/hadoop/conf/slaves)
Since this is single node cluster, the name and datanode are on the same server, so the slaves for this node will be itself, so add the hostname for this machine in masters
Master.node.com
SOME MORE CONFIGURATION CHANGES
Namenode, Jobtracker etc can be viewed in browsers once these are started. The following configuration changes before hand will help you when you verify these services from web browser.
On the VM machine, ensure that the ports are not blocked to communicate with your host machine
You can issue the following command to disable the port blocking
[ ~]# service iptables stop
On the windows host machine ensure that the hostname for the name node (Host name of CENT OS virtual machine) should be added in the host file
C:\Windows\System32\drivers\etc\host
At the end of the file enter the following (where xxx.xxx.x.xx is the ip address for the virtual machine)
xxx.xxx.x.xx master.node.com
Creating some Directories
Ensure these directories are created using the mkdir commands and the user for hadoop installation must have access to these directories.
[ /]# mkdir /data
[ /]# mkdir /data/namenode/
[ /]# mkdir /data/datanode/
Use the following commands to grant the permissions to the hadoop user
[ /]# chown -R hadoop:hadoop /data/namenode/
[ /]# chown -R hadoop:hadoop /data/datanode/
Format the File System
Once the above settings are done properly, we will format the file system.
The following command is used for formatting the hdfs
[ ~]$ hadoop namenode –format
You should see a message something like below to ensure that format has run fine
15/02/14 07:54:56 INFO common.Storage: Storage directory /data/namenode has been successfully formatted.
15/02/14 07:54:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at master.cluster.com/192.168.0.21
************************************************************/
STARTING OPTIONS
Now the setup is complete and we can start the hadoop cluster. There are two options either you can start all the services in one go or each service one by one.
Since this is the first time we are setting up this cluster its better to start the services one by one and by verification of each services as we start them.
Starting the Name node & Verification of Name node
Issue the following command to start only the namenode
[ ~]$ hadoop-daemon.sh start namenode
Issue a JPS to see if the name node has come up
[ ~]$ jps
2527 NameNode
2589 Jps
Start a web browser session and open the following link
http://master.node.com:50070/dfshealth.jsp
Starting the Data node & Verification of data node
Issue the following command to start only the data node
[ ~]$ hadoop-daemon.sh start datanode
Issue a JPS to see if the data node has come up
[ ~]$ jps
2527 NameNode
2714 Jps
2651 DataNode
Start a web browser session and open the following link
http://master.node.com:50070/dfshealth.jsp
Starting Jobtracker, tasktracker and Secondarynamenode and verification
Issue the following commands to start Jobtracker, Tasktracker and Secondary Name node respectively
[ ~]$ hadoop-daemon.sh start jobtracker
[ ~]$ hadoop-daemon.sh start tasktracker
[ ~]$ hadoop-daemon.sh start secondarynamenode
Issue a JPS to see if the above services came up
[ ~]$ jps
3097 Jps
2527 NameNode
2748 JobTracker
3026 SecondaryNameNode
2651 DataNode
2862 TaskTracker
[ ~]$
Goto the following link in browser to see the job tracker
http://master.node.com:50030/jobtracker.jsp
---- PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED----