In this tutorial I wanted to share with you my experiences on how to setup and start a local Hadoop environment. The goal here is to create a local environment in which you will be able to run MapReduce jobs and sample YARN applications.
This tutorial is tailored towards a UNIX-like environment (Linux or Mac OSX). Windows users will have to adopt the corresponding setup steps to their needs.
Before we start, we first want to take a closer look at the basic components which are part of a Hadoop distribution. This knowledge will give you a better understanding when it comes to configuring the system. So what are the main components in a minimal Hadoop setup? Mainly you have two: HDFS for storage and YARN for resource scheduling and processing. You also get the MapReduce runtime delivered with the standard distribution, but this one won't be covered in this article.
The Hadoop File System (HDFS) is the workhorse in a Hadoop system. It is responsible for storing and managing the files in a distributed, redundant and secure way. Replication is here the key concept to ensure reliability.
The image below shows an oversimplified view of the components involved.
NameNode: Manages the location of each block of a given file. The NameNode is the single point of entry to your HDFS system.
Secondary NameNode: Every change to the file system is written as a forward log to the Secondary NameNode. The main task of the Secondary NameNode is to keep track of the changes and to rebalance the main index (fsimage) periodically. The idea behind this is: performance. By delegating the task of rebalancing to the Secondary NameNode, the actual NameNode can serve client requests faster.
DataNode: The DataNodes hold the actual data blocks in the corresponding HDFS file on their local disk.
This is by far not the complete picture. The detailed interaction workflow (read/write) between the client and NameNode/Datanodes is out of scope of this document and will be covered in a seperate article.
In the Hadoop 2.x distribution, the NameNode can be run in an HA (High Availability) mode. Here the passive NameNode (which is not serving client requests) is handling the rebalancing of the fsimage file. In a fail over scenario, where the main NameNode is not reachable anymore, it will change its status from passive to active and will start serving client requests. We won't cover this setup in this tutorial, because this is a more advanced topic and is only relevent in a productiv cluster environment.
YARN stands for Yet Another Resource Negotiator and is responsible for allocating resources on the actual compute nodes (NodeManager).
ResourceManager: The ResouceManager is the main component in YARN. It is responsible to allocate resouces on the NodeManagers and assign them to the application which requested them.
NodeManager: This is the place where the actual containers will live. The NodeManagers provide compute power to the the application by exposing available resouces (mainly RAM aka slots) to the ResourceManager. The negotiation is done between the client and the RM, the actual containrs (JVM process) will run on the NodeManagers.
Now it's time to choose the Hadoop version which we want to use in our setup. There are currenlty three version tracks available
1.2.X - current stable version, 1.2 release
2.6.X - stable 2.x version
0.23.X - similar to 2.x.x but missing NameNode HA
Since the 1.2.x based version is concidered to be legacy we will cover the 2.x version of Hadoop in this tutorial.
Download & Install
Ok, now lets get started. The main entry point for getting started with Hadoop is the Apache Hadoop Project website. Here you will find all the necessary binaries to install Hadoop.
$ mkdir hadoop $ cd hadoop $ wget http://mirror.netcologne.de/apache.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
Now unzip the archive
$ tar xvfz hadoop-2.6.0.tar.gz
Now you should have the basic hadoop distribution on your machine. The next step would be to configure and start the HDFS and YARN deamons.
Configure & Startup
First you have to make sure that you can ssh into your localhost without using a password (passwordless access).
If the following command goes through withouth prompting a password input you are all set.
$ ssh localhost
If that is not the case you need to create/add a private/public key pair and add it to your
authorized_keys file in your
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
HDFS on a single Node
Now it is time to configure the HDFS part of your system. The configuration files can be found in the
etc/hadoop/ directory (assuming you are in the unzipped hadoop distribution, in our case this would be
$ cd etc/hadoop $ ls -al
This should give you a long list of configuration files. For our setup we will only be editing a small subset.
First we need to specify the endpoint of our NameNode in the
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration>
Since we are only running a single-node Hadoop, we will set the global replication factor to 1.
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>
That was it for the configuration part so far. Before we can run any Hadoop commands, we need to set the
$ export JAVA_HOME=/usr/lib/jvm/PATH_TO_JAVA_RUNTIME
You could also export the
JAVA_HOME environment variable in the
yarn-env.sh in the configuration directory. This will prevent you from getting
JAVA_HOME is not set and could not be found. exceptions when you start the deamons.
Now it is time to format (create) your HDFS.
$ ./bin/hdfs namenode -format
This will create your HDFS in the
/tmp directory. To change this directory (in case you don't want to lose your HDFS after a reboot) you need to add the following property to your
<property> <name>dfs.datanode.data.dir</name> <value>/PATH_TO_MY_HDFS</value> </property>
Now lets start the NameNode and DataNode deamon
Now lets run some commands on HDFS
./bin/hadoop fs -mkdir /tmp
Will create the diretory
/tmp. You can list it via
./bin/hadoop fs -ls /
The NameNode comes with a build in web interface. It is reachable via
To stop the NameNode and DataNode deamon run
YARN on a single Node
In the same way you can now configure YARN and the MapReduce component.
<configuration> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> </configuration>
<configuration> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration>
Run the YARN startup script to instanciate the YARN deamon.
YARN comes also with it's own web interface. It can be reached at
Finally we want to shutdown the YARN deamon via.
You should now have (hopefully successful) setup a local single-node Hadoop system. The next steps would be to write a simple MapReduce job/YARN application. I'm planning to cover those topics in a future tutorial.