Apache Hadoop framework is a big buzz in the IT world. It provides a solution the big data is posing to the digital world. The framework allows for data analysis of large datasets which are distributed across clusters of computers by using a simple programming model.
To start learning about Hadoop you would want to setup a HADOOP environment. There are basically three modes in which hadoop cluster can be installed. Hadoop Pseudo distributed mode The following book will guide on the practical aspects of Hadoop. Hadoop Stand Alone Mode:- To understand the basics of Hadoop and using it as a playground to run some exercise, stand alone mode of hadoop is sufficient. In this mode you install the bare minimum components on a system. Following High level steps are required for Stand alone hadoop setup 1. Set up a virtual machine with any linux environment (CENT OS or Ubuntu) 2. Install JAVA on the virtual machine. 4. Download Hadoop installation files and extract these on your virtual machine. 5. Grant permissions to Hadoop user on the folder where Hadoop is extracted. 6. Change the /home/hadoop/hadoop/conf/hadoop-env.sh file to set the HADOOP_HOME and JAVA_HOME variable —-STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED —- —— STAND ALONE HADOOP INSTALLATION COMPLETED —— After setting up the installation as advised in the above steps you should have a stand alone hadoop installation. You can check the installation by executing the hadoop command from the hadoop home location, where hadoop is installed and you should see the following output Usage: hadoop [–config confdir] COMMAND namenode -format format the DFS filesystem secondarynamenode run the DFS secondary namenode namenode run the DFS namenode datanode run a DFS datanode dfsadmin run a DFS admin client mradmin run a Map-Reduce admin client fsck run a DFS filesystem checking utility fs run a generic filesystem user client balancer run a cluster balancing utility fetchdt fetch a delegation token from the NameNode jobtracker run the MapReduce job Tracker node tasktracker run a MapReduce task Tracker node historyserver run job history servers as a standalone daemon job manipulate MapReduce jobs queue get information regarding JobQueues version print the version distcp copy file or directories recursively archive -archiveName NAME -p * create a hadoop archive classpath prints the class path needed to get the Hadoop jar and the required libraries daemonlog get/set the log level for each daemon CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters. Hadoop Pseudo Distributed Mode:- As the name suggests pseudo distributed mode is not in reality a distributed hadoop installation but it actually simulates one. Distributed mode requires that you setup hadoop installation on multi node cluster (minimum two) but with pseudo distributed mode you can actually get a feel of distributed hadoop environment with hadoop on single cluster. Apart from the above six steps listed above, you will additionally need to do the following 7. Change the following configuration files for details on the changes in configuration files refer article http://mainframewizard.com/content/setup-single-node-hadoop-cluster CORE-SITE.XML(/home/hadoop/hadoop/conf/core-site.xml) HDFS-SITE.XML(/home/hadoop/hadoop/conf/hdfs-site.xml) MAPRED-SITE.XML(/home/hadoop/hadoop/conf/mapred-site.xml) MASTERS(/home/hadoop/hadoop/conf/masters) – Not mandatory SLAVES(/home/hadoop/hadoop/conf/slaves) – Not mandatory , you can get a fully functional pseodo distributed hadoop —-STAND ALONE,PSEODO DISTRIBUTED HADOOP INSTALLATION STARTED —- —- PSEODO DISTRIBUTED HADOOP INSTALLATION COMPLETED—- Hadoop Distributed Mode:- Hadoop installation in a production environment is actually a distributed installation with one name node and several data nodes. For setting up a distributed hadoop cluster, you need atleast two machines, one acting as name node and data node and the other machine acting as data node. To start with you will need to setup two machines in psedo distributed mode as expailaned above. Following are the high level steps required for distributed hadoop installation. Generate SSH key for password less logon on the master node (name node machine) using the following command $ ssh-keygen -t dsa -P “” -f ~/.ssh/id_dsa Copy the generated key to the authorized keys on master node $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys Now copy the generated key to authorized keys on all slave nodes (in this case the other virtual machine with pseudo installation) I will be writing a detailed write up on distributed hadoop setup. Come back to our site for the article on distributed installation.