Hadoop Installation Modes
Apache Hadoop framework is a big buzz in the IT world. It provides a solution the big data is posing to the digital world. The framework allows for data analysis of large datasets which are distributed across clusters of computers by using a simple programming model.
To start learning about Hadoop you would want to setup a HADOOP environment. There are basically three modes in which hadoop cluster can be installed.
Hadoop Stand Alone Mode
Hadoop Pseudo distributed mode
Hadoop distributed mode
The following book will guide on the practical aspects of Hadoop.
Hadoop Stand Alone Mode:-
To understand the basics of Hadoop and using it as a playground to run some exercise, stand alone mode of hadoop is sufficient. In this mode you install the bare minimum components on a system.
Following High level steps are required for Stand alone hadoop setup
1. Set up a virtual machine with any linux environment (CENT OS or Ubuntu)
2. Install JAVA on the virtual machine.
3. Create a Hadoop user
4. Download Hadoop installation files and extract these on your virtual machine.
5. Grant permissions to Hadoop user on the folder where Hadoop is extracted.
6. Change the /home/hadoop/hadoop/conf/hadoop-env.sh file to set the HADOOP_HOME and JAVA_HOME variable
For details on these steps kindly refer article on setting up hadoop cluster on single node and follow the steps between the following tags on this page
----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----
------ STAND ALONE HADOOP INSTALLATION COMPLETED ------
After setting up the installation as advised in the above steps you should have a stand alone hadoop installation. You can check the installation by executing the hadoop command from the hadoop home location, where hadoop is installed
and you should see the following output
[hadoop@nn1 ~]$ hadoop
Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format format the DFS filesystem
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
datanode run a DFS datanode
dfsadmin run a DFS admin client
mradmin run a Map-Reduce admin client
fsck run a DFS filesystem checking utility
fs run a generic filesystem user client
balancer run a cluster balancing utility
fetchdt fetch a delegation token from the NameNode
jobtracker run the MapReduce job Tracker node
pipes run a Pipes job
tasktracker run a MapReduce task Tracker node
historyserver run job history servers as a standalone daemon
job manipulate MapReduce jobs
queue get information regarding JobQueues
version print the version
jar run a jar file
distcp copy file or directories recursively
archive -archiveName NAME -p * create a hadoop archive classpath prints the class path needed to get the
Hadoop jar and the required libraries
daemonlog get/set the log level for each daemon
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
[hadoop@nn1 ~]$
Hadoop Pseudo Distributed Mode:-
As the name suggests pseudo distributed mode is not in reality a distributed hadoop installation but
it actually simulates one. Distributed mode requires that you setup hadoop installation on multi node cluster (minimum two) but with pseudo distributed mode you can actually get a feel of distributed hadoop environment with hadoop on single cluster.
Apart from the above six steps listed above, you will additionally need to do the following
7. Change the following configuration files for details on the changes in configuration files refer article
CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml)
HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml)
MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml)
MASTERS(/home/hadoop/hadoop/conf/masters) - Not mandatory
SLAVES(/home/hadoop/hadoop/conf/slaves) - Not mandatory
If you follow all the steps between the following tags, shown in article for single node hadoop cluster
, you can get a fully functional pseodo distributed hadoop
installation
----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----
---- PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED----
Hadoop Distributed Mode:-
Hadoop installation in a production environment is actually a distributed installation with one name node and several data nodes. For setting up a distributed hadoop cluster, you need atleast two machines, one acting as name node and data node and the other machine acting as data node.
To start with you will need to setup two machines in psedo distributed mode as expailaned above. Following are the high level steps required for distributed hadoop installation.
Generate SSH key for password less logon on the master node (name node machine) using the following command
$ ssh-keygen -t dsa -P “” -f ~/.ssh/id_dsa
Copy the generated key to the authorized keys on master node
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now copy the generated key to authorized keys on all slave nodes (in this case the other virtual machine with pseudo installation)
I will be writing a detailed write up on distributed hadoop setup. Come back to our site for the article on distributed installation.