Hadoop Installation Modes

Apache Hadoop framework is a big buzz in the IT world. It provides a solution the big data is posing to the digital world. The framework allows for data analysis of large datasets which are distributed across clusters of computers by using a simple programming model. 

 
To start learning about Hadoop you would want to setup a HADOOP environment. There are basically three modes in which hadoop cluster can be installed. 
 
Hadoop Stand Alone Mode
Hadoop Pseudo distributed mode
Hadoop distributed mode
 
The following book will guide on the practical aspects of Hadoop.
Hadoop Stand Alone Mode:-
 
To understand the basics of Hadoop and using it as a playground to run some exercise, stand alone mode of hadoop is sufficient. In this mode you install the bare minimum components on a system.
 
Following High level steps are required for Stand alone hadoop setup
 
1. Set up a virtual machine with any linux environment (CENT OS or Ubuntu)
2. Install JAVA on the virtual machine.
3. Create a Hadoop user
4. Download Hadoop installation files and extract these on your virtual machine.
5. Grant permissions to Hadoop user on the folder where Hadoop is extracted.
6. Change the /home/hadoop/hadoop/conf/hadoop-env.sh file to set the HADOOP_HOME and JAVA_HOME variable
 
For details on these steps kindly refer article on setting up hadoop cluster on single node and follow the steps between the following tags on this page
 
----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----
------           STAND ALONE HADOOP INSTALLATION COMPLETED  ------
 
After setting up the installation as advised in the above steps you should have a stand alone hadoop installation. You can check the installation by executing the hadoop command from the hadoop home location, where hadoop is installed
 
and you should see the following output
 
[hadoop@nn1 ~]$ hadoop                                                     
Usage: hadoop [--config confdir] COMMAND                                   
where COMMAND is one of:                                                   
  namenode -format     format the DFS filesystem                           
  secondarynamenode    run the DFS secondary namenode                      
  namenode             run the DFS namenode                                
  datanode             run a DFS datanode                                  
  dfsadmin             run a DFS admin client                              
  mradmin              run a Map-Reduce admin client                       
  fsck                 run a DFS filesystem checking utility               
  fs                   run a generic filesystem user client                
  balancer             run a cluster balancing utility                     
  fetchdt              fetch a delegation token from the NameNode          
  jobtracker           run the MapReduce job Tracker node                  
  pipes                run a Pipes job                                     
  tasktracker          run a MapReduce task Tracker node                   
  historyserver        run job history servers as a standalone daemon      
  job                  manipulate MapReduce jobs                           
  queue                get information regarding JobQueues                 
  version              print the version                                   
  jar            run a jar file                                      
  distcp copy file or directories recursively           
  archive -archiveName NAME -p * create a hadoop    archive                                                                          classpath            prints the class path needed to get the      
                       Hadoop jar and the required libraries               
  daemonlog            get/set the log level for each daemon               
 or                                                                        
  CLASSNAME            run the class named CLASSNAME                       
Most commands print help when invoked w/o parameters.                      
[hadoop@nn1 ~]$                                                            
 
 
Hadoop Pseudo Distributed Mode:-
 
As the name suggests pseudo distributed mode is not in reality a distributed hadoop installation but
it actually simulates one. Distributed mode requires that you setup hadoop installation on multi node cluster (minimum two) but with pseudo distributed mode you can actually get a feel of distributed hadoop environment with hadoop on single cluster.
 
Apart from the above six steps listed above, you will additionally need to do the following
 
7. Change the following configuration files for details on the changes in configuration files refer article
 
CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml)
HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml)
MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml)
MASTERS(/home/hadoop/hadoop/conf/masters) - Not mandatory
SLAVES(/home/hadoop/hadoop/conf/slaves) - Not mandatory
 
If you follow all the steps between the following tags, shown in article for single node hadoop cluster
, you can get a fully functional pseodo distributed hadoop 
installation
 
----STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED ----
----  PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED----
 
 
Hadoop Distributed Mode:-
 
Hadoop installation in a production environment is actually a distributed installation with one name node and several data nodes. For setting up a distributed hadoop cluster, you need atleast two machines, one acting as name node and data node and the other machine acting as data node.
 
To start with you will need to setup two machines in psedo distributed mode as expailaned above. Following are the high level steps required for distributed hadoop installation.
 
Generate SSH key for password less logon on the master node (name node machine) using the following command
 
$ ssh-keygen -t dsa -P “” -f ~/.ssh/id_dsa
 
Copy the generated key to the authorized keys on master node
 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
 
Now copy the generated key to authorized keys on all slave nodes (in this case the other virtual machine with pseudo installation)
 
I will be writing a detailed write up on distributed hadoop setup. Come back to our site for the article on distributed installation.