Hadoop 101: Multi-node installation using AWS EC2

In this post, we will build the Hadoop cluster system using three EC2 instances ( one for master, two for slaves).

(I will assume that you know how to use AWS. If you don’t know please check this link)

To run Map-Reduce task, you need enough memory. Therefore, we will use t2.medium type instance.

(If you are students and need some free credit, check this link.)

  • AWS EC2 t2.medium×3 (1 for a name node, 2 for data nodes)
  • Ubuntu 18.04
  • Hadoop 3.1.1

The plan is, we will do all the settings using one instance first. And then make an image of that instance. Then we will launch two more instances from that image.

!!!Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.

Step 1. Launch EC2 Instance.

Choose Ubuntu 18.04
Choose t2.medium
Put 15 GB for the Storage Size
[Important] Create security group with new name, and open only 22 port.
It’s launching now.
If you are able to connect using SSH, step 1 is done.

Remember that we will set the environment as much as possible on this instance first.

Step 2. Update the Packages and Download Hadoop

Before we get started to install Hadoop, we need to prepare basic working enviornment.

Type following command to install the relevant software and packages.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default

If you face any types of prompt, just type yes or ok, do whatever you can agree.

Now, let’s download Hadoop and set it up.

wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -xzvf hadoop-3.1.1.tar.gz
sudo mv hadoop-3.1.1 /usr/local/hadoop

Then, we have to set the path for Hadoop and Java

sudo vi /etc/environment
PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"
source /etc/environment

Okay, now the basic installation is done. You can test the basic map-reduce program using the following command.

/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output
cat ~/output/part-r-*

Step 3. Set up Hadoop Enviroment

Based on my experience, the most important thing in this step is, just double, triple checking.

sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/workers
sudo vi /usr/local/hadoop/etc/hadoop/masters

Step 5. Set SSH

I think this is the most confusing part for the people who are not familiar with Linux or system stuff.

The purpose of this step is, giving the master node the access permission for the slave nodes. It doesn’t need to be the other way.

Firstly, generate the key with the following command. (Just press enter when the instance asks you to type something.)

ssh-keygen -t rsa

And append it to ‘authorized_keys’.

cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub

If you can access the instance itself without typing the password with the following command, you are done with SSH setting.

ssh localhost

Step 6. Create AMI

From this part, we will copy the instance.

If you right-click the instance, you can find the Create Image button.
Just name it as 'hadoop-node'
You can find the new image in the AMI section.

Step 7. Launch Two more instances

Now with the Image that we made, we can launch the exactly the same instances.

  • Select t2.medium
  • Set 2 instances
  • Select the same security group
If the image is prepared, now you should launch new instances.
Choose t2.medium again.
Put '2' for the number of instances.
And select the same security group that you created.
Okay, two more instances are launching, and I named them as master, slave1, and slave2 respectively.

Step 8. Update the security group

It is really important to use security group properly, to open the ports and to limit the access permission.

If you just open all the ports to the world, you will get DoS attack by random people in the world. (It really happened to me and your AWS cost will be over $100 easily.)

Go to the security group and choose the group that we made for the Hadoop cluster.
Okay, here is the important part, we are opening all the port by choosing 'All traffic' but, source will be the nodes using this security group.
You just need to search 'hadoop' from the source form then it will auto complete the full name.

Step 9. Make a ssh connection

Okay, when I was following others tutorial, I used to skip this part, and I always regretted. Please use a hostname instead of IP address.

sudo vi /etc/hosts
This is really important! You should use private IP of each instances
Name them like line 2 to 4.

Then, it should be able to connect each instance using the following command without typing the password.

ssh slave1
ssh slave2

To escape from the slave and go back to master, just type 'exit'.

Lastly, update each instance's hosts file as well with following command.

cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'"
cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"

Step 10. Test Hadoop Cluster

First, you need to initialize HDFS.

hdfs namenode -format
If you see these messages, it is working fine.

And start the HDFS related processes.

start-dfs.sh
If you type jps and see the Namenode and Secondary NameNode.
If access to one of the slaves, and type jps, you will see DataNode instead.

Make a directory and check if it is made.

hadoop fs -mkdir /test
hadoop fs -ls /
If you can see the consoles like above, it is working.

To see, the datanodes, you can type the command like below.

hdfs dfsadmin -report
Two datanodes are having exactly the same amount of files(24KB) since our number of replication is 2.

Now, let's start yarn with following command.

start-yarn.sh
From the master node, you will see processes like above, from slave node, you will see DataNode and NodeManger.

You should see two slave nodes, if you type following command.

yarn node -list

Okay, let's test if our yarn is working properly.

First, upload the file to HDFS with following command.

hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/

And run the wordcount example with the command below.

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output

If you read the file,

hadoop fs -text /test/output/*
It should be like this.

Step 11(Optional). Test MRjob

If you are planning to work with MRjob, you try with following commands.

https://github.com/Yelp/mrjob

git clone https://github.com/Yelp/mrjob.git
cd mrjob/
sudo python setup.py install
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts

Don't forget to stop/terminate the instances after your job if you don't want to pay unnecessarily.

Thank you!