Installing Hadoop in CentOS 7

In this article I’m going to focus in how to install Hadoop in CentOS 7. At the time of this writing there are two mantanined version of Hadoop that can be useful for different purposes, for these reasons we will install both. For Hadoop 1, we will use hadoop-1.2.1, for Hadoop 2, hadoop-2.6.0. For other versions check Hadoop Download Page. Also worth notice that the hostname of the machine where I’m doing the instalation is centos-vm and I will use this name instead of localhost because this will make possible to work with this installation remotely.

Installing Hadoop

There are different installation choices, I prefer to stay with the binaries because it gives me all the control about what is configured and where. Depending on the linux distribution choosen, using the RPMs could be another good choice.

Because every version of Hadoop can introduce small changes this may not work in other versions, the best way to install it is following the instructions in the documentation included in the tar file.

For each Hadoop version we will create a folder /opt/apache-hadoop/hadoop-x.y.z and create a symbolic link /opt/apache-hadoop/hadoopx that will make updates easier. In our case this translates into:

Hadoop 1
- Directory: /opt/apache-hadoop/hadoop-1.2.1
- Symbolic link: /opt/apache-hadoop/hadoop1
Hadoop 2
- Directory: /opt/apache-hadoop/hadoop-2.6.0
- Symbolic link: /opt/apache-hadoop/hadoop2

Before the installation we also need to create a user, a group and set the permissions:

# executed as super user
  	
groupadd hadoop
useradd -g hadoop -c "Hadoop user" hadoop
mkdir /var/log/hadoop
chown -R hadoop:hadoop /var/log/hadoop
chmod -R u+xwr,g+xwr,o+xr /var/log/hadoop
mkdir /var/data/hadoop
chown -R hadoop:hadoop /var/data/hadoop
chmod -R u+xwr,g+xwr,o+xr /var/data/hadoop

After this we must set the new user to access to the host without authenticating. This can be checked with the command:

ssh centos-vm

if you are asked for a password do the following:

 # no parameters required (use default parameters)

ssh-keygen 

 # copy the generated key (would also work for a remote server)

ssh-copy-id -i ~/.ssh/id_rsa.pub centos-vm

 # stop the ssh session (exit) and try

ssh centos-vm

With this steps you should be able to access to the server without typing the password, making easier to use Hadoop’s start/stop scripts.

Hadoop 1

Tar file structure

The uncompressed tar file has the following structure for Hadoop 1:

hadoop-1.z.y/
├── bin
├── c++
├── conf
├── contrib
├── docs
├── ivy
├── lib
├── libexec
├── logs
├── sbin
├── share
├── src
└── webapps

The documentation for this version is in the docs directory. The details for the single node installation can be found in docs/single_node_setup.html. Apart of the installation process, authentication/access problems are their solutions are also described in this document.

Editing the configuration files for a Pseudo-distributed configuration

The first step is to set the JAVA_HOME variable in /opt/apache-hadoop/hadoop1/conf/hadoop-env.sh. Different Hadoop versions may need different Java SDK versions, this versions table can be found here. I’m currently using Java 8 (1.8.0_25) and Hadoop is running without any issues. If at this point you don’t have Java in your system go to Installing Java in CentOS 7.

Supposing that the right version of java is installed in /usr/java/latest the following command will change the configuration in the file conf/hadoop-env.sh. Notice that this may not be required is you already have the right JDK installed and the JAVA_HOME variable defined (check using echo "$JAVA_HOME").

sed -i -e "s-^.*JAVA_HOME.*-export JAVA_HOME=/usr/java/latest-g"  /opt/apache-hadoop/hadoop1/conf/hadoop-env.sh

In order to be a good linux citizen we will also configure the log path (/var/log/hadoop):

    
sed -i -e "s-^.*HADOOP_LOG_DIR.*-export HADOOP_LOG_DIR=/var/log/hadoop-g"  /opt/apache-hadoop/hadoop1/conf/hadoop-env.sh

For a Pseudo-Distributed configuration these files must also be modified:

conf/core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://centos-vm:9000</value>
  </property>
</configuration>

conf/hdfs-site.xml:

In the original Hadoop instructions, the properties below are not configured:
- dfs.name.dir
- dfs.data.dir
- dfs.client.buffer.dir
- mapred.local.dir

Without this the required file structures will be created by default in the /tmp/ folder. This is not a good practice as some systems are configured to delete the content of this folder when the system restarts. To avoid this just configure any writable directory that will guaranty the persistence of the data. In the example below, we will assume this folder is /var/data/hadoop:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>     
  <property>
    <name>dfs.name.dir</name>
    <value>/var/data/hadoop/dfs</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>/var/data/hadoop/data</value>
  </property>
  <property>
    <name>dfs.client.buffer.dir</name>
    <value>/var/data/hadoop/buffer</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>/var/data/hadoop/mapred</value>
  </property>
</configuration>

conf/mapred-site.xml:

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>centos-vm:9001</value>
  </property>
</configuration>

Initializing Hadoop

Once all the configuration have been set, it is required to format the filesystem with the following command:

cd /opt/apache-hadoop/hadoop1/bin
./hadoop namenode -format

The output should be similar to:

14/12/20 01:35:12 INFO namenode.NameNode: STARTUP_MSG: 
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = centos-vm/127.0.0.1
STARTUP_MSG:   args = [-format]
STARTUP_MSG:   version = 1.2.1
STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG:   java = 1.8.0_25
************************************************************/
14/12/20 01:35:12 INFO util.GSet: Computing capacity for map BlocksMap
14/12/20 01:35:12 INFO util.GSet: VM type       = 64-bit
14/12/20 01:35:12 INFO util.GSet: 2.0% max memory = 932184064
14/12/20 01:35:12 INFO util.GSet: capacity      = 2^21 = 2097152 entries
14/12/20 01:35:12 INFO util.GSet: recommended=2097152, actual=2097152
14/12/20 01:35:13 INFO namenode.FSNamesystem: fsOwner=hadoop
14/12/20 01:35:13 INFO namenode.FSNamesystem: supergroup=supergroup
14/12/20 01:35:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/12/20 01:35:13 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/12/20 01:35:13 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/12/20 01:35:13 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/12/20 01:35:13 INFO namenode.NameNode: Caching file names occuring more than 10 times 
14/12/20 01:35:13 INFO common.Storage: Image file /home/hadoop/data/1/dfs/current/fsimage of size 112 bytes saved in 0 seconds.
14/12/20 01:35:13 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/data/1/dfs/current/edits
14/12/20 01:35:13 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/data/1/dfs/current/edits
14/12/20 01:35:13 INFO common.Storage: Storage directory /home/hadoop/data/1/dfs has been successfully formatted.
14/12/20 01:35:13 INFO namenode.NameNode: SHUTDOWN_MSG: 
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at centos-vm/127.0.0.1
************************************************************/

This will using the folders structure configured previosly. When this step is done we are ready to start the system

starting/stopping the node

All the scrips to manage Hadoop can be also found in the bin folder (/opt/apache-hadoop/hadoop1/bin)

Start the hadoop daemons:

start-all.sh

To stop the node use:

stop-all.sh

When the service is started, Hadoop offers web interfaces to know the status of the system:

NameNode - http://centos-vm:50070/
DataNode - http://centos-vm:50075/
JobTracker - http://centos-vm:50030/

Make Hadoop web interface available for remote access

firewall-cmd --permanent --zone=public --add-port=50070/tcp
firewall-cmd --permanent --zone=public --add-port=50075/tcp
firewall-cmd --permanent --zone=public --add-port=50030/tcp
firewall-cmd --reload

Hadoop 2

For Hadoop 2 some details are slightly different so we will go through them, Also the introduction of the new MapReduce engine, YARN, will require some aditional configuration.

Tar file structure

hadoop-2.6.0/
├── bin
├── etc
├── include
├── lib
├── libexec
├── sbin
└── share

For this version the documentation is in share/doc and the single node installation in share/doc/hadoop/hadoop-project-dist/hadoop-common/SingleCluster.html.

Editing the configuration files for a Pseudo-distributed configuration

In the same way as we did for Hadoop 1, we may need to redefine JAVA_HOME, the hadoop-env.sh file has also been moved to another directory: /opt/apache-hadoop/hadoop2/etc/hadoop/hadoop-env.sh. Because of the number of environment variables required to have a proper configuration I opted for defining then as a script in the /etc/profile.d directory

/etc/profile.d/hadoop.sh

export HADOOP_HOME=/opt/apache-hadoop/default
export HADOOP=$HADOOP_HOME/bin
export PATH=$HADOOP:$PATH

export HADOOP_PREFIX=$HADOOP_HOME
export HADOOP_CONF_DIR=$HADOOP_PREFIX/etc/hadoop
export HADOOP_YARN_HOME=$HADOOP_HOME

export NN_DATA_DIR=/var/data/hadoop/hdfs/nn
export SNN_DATA_DIR=/var/data/hadoop/hdfs/snn
export DN_DATA_DIR=/var/data/hadoop/hdfs/dn
export YARN_LOG_DIR=/var/log/hadoop/yarn
export HADOOP_LOG_DIR=/var/log/hadoop/hdfs
export HADOOP_MAPRED_LOG_DIR=/var/log/hadoop/mapred
export YARN_PID_DIR=/var/run/hadoop/yarn
export HADOOP_PID_DIR=/var/run/hadoop/hdfs
export HADOOP_MAPRED_PID_DIR=/var/run/hadoop/mapred

this can also be defined on the .bashrc file in the hadoop user home directory.

etc/hadoop/core-site.xml:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://centos-vm:9000</value>
  </property>
</configuration>

etc/hadoop/hdfs-site.xml:

Notice that in this version the file paths must be given as URIs.

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.name.dir</name>
    <value>file:///var/data/hadoop/dfs</value>
  </property>
  <property>
    <name>dfs.data.dir</name>
    <value>file:///var/data/hadoop/dfs/data</value>
  </property>
  <property>
    <name>dfs.client.buffer.dir</name>
    <value>file:///var/data/hadoop/dfs/buffer</value>
  </property>
  <property>
    <name>mapred.local.dir</name>
    <value>file://var/data/hadoop/mapred</value>
  </property>
</configuration>

etc/hadoop/mapred-site.xml (must be created or copied from mapred-site.xml.template):

<configuration>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>

etc/hadoop/yarn-site.xml:

<configuration>
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property>
</configuration>

Initializing Hadoop

The command to format the filesystem has been changed in Hadoop2:

cd /opt/apache-hadoop/hadoop2/bin
./hdfs namenode -format

starting/stopping the node

The directory where the scripts to start/stop that system has also change, being located now in the folder sbin (/opt/apache-hadoop/hadoop2/sbin). Notice that for this version the start-all.sh and stop-all.sh have been deprecated. We will also need certain environment variables declared to make the configuration work, because this installation is only for development purposes, we will declare the environment variables in the startup script

Start Hadoop2:

mkdir -p /var/run/hadoop
chown -R hadoop:hadoop /var/run/hadoop
echo "Starting Hadoop2 NameNode and DataNode"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/start-dfs.sh'
echo "Starting Hadoop2 YARN"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/start-yarn.sh'
echo "Starting Job History server"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/mr-jobhistory-daemon.sh start historyserver'

Stop Hadoop2:

echo "Stopping Hadoop2 NameNode and DataNode"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/stop-dfs.sh'
echo "Stopping Hadoop2 YARN"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/stop-yarn.sh'
echo "Stopping Job History Server"
su - hadoop -c '/opt/apache-hadoop/hadoop2/sbin/mr-jobhistory-daemon.sh stop historyserver'

Once the service is started, Hadoop offers web interfaces to know the status of the system:

NameNode - http://centos-vm:50070/
DataNode - http://centos-vm:50075/
Cluster Metrics - http://centos-vm:8088/
Job History - http://centos-vm:19888/
NodeManager Information - http://centos-vm:8042/

Make Hadoop web interface available for remote access

firewall-cmd --permanent --zone=public --add-port=50070/tcp
firewall-cmd --permanent --zone=public --add-port=50075/tcp
firewall-cmd --permanent --zone=public --add-port=8088/tcp
firewall-cmd --permanent --zone=public --add-port=19888/tcp
firewall-cmd --permanent --zone=public --add-port=8042/tcp
firewall-cmd --reload

Updated 2015-03-16: Configuration improved to use linux standards. Start / Stop script improved. New Ports to open added to the list.