Installing Hadoop in CentOS 7
In this article I’m going to focus in how to install Hadoop in CentOS 7. At the time of this writing there are two mantanined version of Hadoop that can be useful for different purposes, for these reasons we will install both. For Hadoop 1, we will use hadoop-1.2.1, for Hadoop 2, hadoop-2.6.0. For other versions check Hadoop Download Page. Also worth notice that the hostname of the machine where I’m doing the instalation is centos-vm and I will use this name instead of localhost because this will make possible to work with this installation remotely.
Installing Hadoop
There are different installation choices, I prefer to stay with the binaries because it gives me all the control about what is configured and where. Depending on the linux distribution choosen, using the RPMs could be another good choice.
Because every version of Hadoop can introduce small changes this may not work in other versions, the best way to install it is following the instructions in the documentation included in the tar file.
For each Hadoop version we will create a folder /opt/apache-hadoop/hadoop-x.y.z
and create a symbolic link /opt/apache-hadoop/hadoopx
that will make updates easier. In our case this translates into:
- Hadoop 1
- Directory:
/opt/apache-hadoop/hadoop-1.2.1
- Symbolic link:
/opt/apache-hadoop/hadoop1
- Directory:
- Hadoop 2
- Directory:
/opt/apache-hadoop/hadoop-2.6.0
- Symbolic link:
/opt/apache-hadoop/hadoop2
- Directory:
Before the installation we also need to create a user, a group and set the permissions:
After this we must set the new user to access to the host without authenticating. This can be checked with the command:
if you are asked for a password do the following:
With this steps you should be able to access to the server without typing the password, making easier to use Hadoop’s start/stop scripts.
Hadoop 1
Tar file structure
The uncompressed tar file has the following structure for Hadoop 1:
hadoop-1.z.y/
├── bin
├── c++
├── conf
├── contrib
├── docs
├── ivy
├── lib
├── libexec
├── logs
├── sbin
├── share
├── src
└── webapps
The documentation for this version is in the docs
directory. The details for the single node installation can be found in docs/single_node_setup.html
. Apart of the installation process, authentication/access problems are their solutions are also described in this document.
Editing the configuration files for a Pseudo-distributed configuration
The first step is to set the JAVA_HOME
variable in /opt/apache-hadoop/hadoop1/conf/hadoop-env.sh
. Different Hadoop versions may need different Java SDK versions, this versions table can be found here. I’m currently using Java 8 (1.8.0_25) and Hadoop is running without any issues. If at this point you don’t have Java in your system go to Installing Java in CentOS 7.
Supposing that the right version of java is installed in /usr/java/latest
the following command will change the configuration in the file conf/hadoop-env.sh
. Notice that this may not be required is you already have the right JDK installed and the JAVA_HOME variable defined (check using echo "$JAVA_HOME"
).
In order to be a good linux citizen we will also configure the log path (/var/log/hadoop
):
For a Pseudo-Distributed configuration these files must also be modified:
- conf/core-site.xml:
-
conf/hdfs-site.xml:
In the original Hadoop instructions, the properties below are not configured:
- dfs.name.dir
- dfs.data.dir
- dfs.client.buffer.dir
- mapred.local.dir
Without this the required file structures will be created by default in the /tmp/
folder. This is not a good practice as some systems are configured to delete the content of this folder when the system restarts. To avoid this just configure any writable directory that will guaranty the persistence of the data. In the example below, we will assume this folder is /var/data/hadoop
:
- conf/mapred-site.xml:
Initializing Hadoop
Once all the configuration have been set, it is required to format the filesystem with the following command:
The output should be similar to:
14/12/20 01:35:12 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = centos-vm/127.0.0.1
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 1.2.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.2 -r 1503152; compiled by 'mattf' on Mon Jul 22 15:23:09 PDT 2013
STARTUP_MSG: java = 1.8.0_25
************************************************************/
14/12/20 01:35:12 INFO util.GSet: Computing capacity for map BlocksMap
14/12/20 01:35:12 INFO util.GSet: VM type = 64-bit
14/12/20 01:35:12 INFO util.GSet: 2.0% max memory = 932184064
14/12/20 01:35:12 INFO util.GSet: capacity = 2^21 = 2097152 entries
14/12/20 01:35:12 INFO util.GSet: recommended=2097152, actual=2097152
14/12/20 01:35:13 INFO namenode.FSNamesystem: fsOwner=hadoop
14/12/20 01:35:13 INFO namenode.FSNamesystem: supergroup=supergroup
14/12/20 01:35:13 INFO namenode.FSNamesystem: isPermissionEnabled=true
14/12/20 01:35:13 INFO namenode.FSNamesystem: dfs.block.invalidate.limit=100
14/12/20 01:35:13 INFO namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s)
14/12/20 01:35:13 INFO namenode.FSEditLog: dfs.namenode.edits.toleration.length = 0
14/12/20 01:35:13 INFO namenode.NameNode: Caching file names occuring more than 10 times
14/12/20 01:35:13 INFO common.Storage: Image file /home/hadoop/data/1/dfs/current/fsimage of size 112 bytes saved in 0 seconds.
14/12/20 01:35:13 INFO namenode.FSEditLog: closing edit log: position=4, editlog=/home/hadoop/data/1/dfs/current/edits
14/12/20 01:35:13 INFO namenode.FSEditLog: close success: truncate to 4, editlog=/home/hadoop/data/1/dfs/current/edits
14/12/20 01:35:13 INFO common.Storage: Storage directory /home/hadoop/data/1/dfs has been successfully formatted.
14/12/20 01:35:13 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at centos-vm/127.0.0.1
************************************************************/
This will using the folders structure configured previosly. When this step is done we are ready to start the system
starting/stopping the node
All the scrips to manage Hadoop can be also found in the bin folder (/opt/apache-hadoop/hadoop1/bin
)
- Start the hadoop daemons:
- To stop the node use:
When the service is started, Hadoop offers web interfaces to know the status of the system:
NameNode - http://centos-vm:50070/
DataNode - http://centos-vm:50075/
JobTracker - http://centos-vm:50030/
Make Hadoop web interface available for remote access
Hadoop 2
For Hadoop 2 some details are slightly different so we will go through them, Also the introduction of the new MapReduce engine, YARN, will require some aditional configuration.
Tar file structure
hadoop-2.6.0/
├── bin
├── etc
├── include
├── lib
├── libexec
├── sbin
└── share
For this version the documentation is in share/doc
and the single node installation in share/doc/hadoop/hadoop-project-dist/hadoop-common/SingleCluster.html
.
Editing the configuration files for a Pseudo-distributed configuration
In the same way as we did for Hadoop 1, we may need to redefine JAVA_HOME, the hadoop-env.sh
file has also been moved to another directory: /opt/apache-hadoop/hadoop2/etc/hadoop/hadoop-env.sh
. Because of the number of environment variables required to have a proper configuration I opted for defining then as a script in the /etc/profile.d
directory
/etc/profile.d/hadoop.sh
this can also be defined on the .bashrc
file in the hadoop user home directory.
- etc/hadoop/core-site.xml:
- etc/hadoop/hdfs-site.xml:
Notice that in this version the file paths must be given as URIs.
- etc/hadoop/mapred-site.xml (must be created or copied from mapred-site.xml.template):
- etc/hadoop/yarn-site.xml:
Initializing Hadoop
The command to format the filesystem has been changed in Hadoop2:
starting/stopping the node
The directory where the scripts to start/stop that system has also change, being located now in the folder sbin (/opt/apache-hadoop/hadoop2/sbin
). Notice that for this version the start-all.sh
and stop-all.sh
have been deprecated. We will also need certain environment variables declared to make the configuration work, because this installation is only for development purposes, we will declare the environment variables in the startup script
- Start Hadoop2:
- Stop Hadoop2:
Once the service is started, Hadoop offers web interfaces to know the status of the system:
NameNode - http://centos-vm:50070/
DataNode - http://centos-vm:50075/
Cluster Metrics - http://centos-vm:8088/
Job History - http://centos-vm:19888/
NodeManager Information - http://centos-vm:8042/
Make Hadoop web interface available for remote access
Updated 2015-03-16: Configuration improved to use linux standards. Start / Stop script improved. New Ports to open added to the list.