Hadoop 101: Multi-node installation using AWS EC2

Hadoop 101: Multi-node installation using AWS EC2

In this post, we will build the multi-node Hadoop cluster using three EC2 instances ( one for master, two for slaves).

(I will assume that you know how to use AWS. If you don’t, please check this link)



To run Map-Reduce task properly, you need enough memory. Therefore, we will use t2.medium type instance.

(If you are a student and need some free credit, check this link.)

  • AWS EC2 t2.medium×3 (1 for a name node, 2 for data nodes)
  • Ubuntu 18.04
  • Hadoop 3.1.1

Here is the plan.
First of all, we will do all the setting into one instance. Then we will make an image of that instance. After that, we will launch two more instances using that image.

!!!Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.

Step 1. Launch EC2 Instance.

Choose Ubuntu 18.04
Choose t2.medium
Put 15 GB for the Storage Size
[Important] Create security group with new name, and open only 22 port.
Then, if you click “Review and Launch”,
It will launch like this.
If you can connect to the instance using SSH, step 1 is done.

Remember that we will set the Hadoop environment as much as possible on this instance.

Step 2. Update the Packages and Download Hadoop

Before we get started to install Hadoop, we need to prepare basic working environment.

Type following command to install the relevant software and packages.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default

If you face any types of prompt, just type yes or ok, do whatever you can agree.

Now, let’s download Hadoop and set it up.

wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -xzvf hadoop-3.1.1.tar.gz
sudo mv hadoop-3.1.1 /usr/local/hadoop

Then, we have to set the path for Hadoop and Java

sudo vi /etc/environment

Change the content with following PATH.

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"

Then, apply the new PATH using the following command.

source /etc/environment

Okay, now the basic installation is done. You can test the basic map-reduce code using the following command.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output
cat ~/output/part-r-*
Great!

!If you see following error that JAVA_HOME is not set.


Then let’s just export JAVA_HOME again.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre

Step 3. Set up Hadoop Enviroment

Based on my experience the most important thing in this step is checking the code double, triple time.

sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/workers
sudo vi /usr/local/hadoop/etc/hadoop/masters

Step 5. Set SSH

I think this is the most confusing part for the people who are not familiar with Linux or system stuff.

The purpose of this step is giving the master node the ssh access permission for the slave nodes. (It doesn’t need to be done to the other way.)

Firstly, generate the key using the following command. (Just press enter when the instance asks you to type something.)

ssh-keygen -t rsa

And append it to ‘authorized_keys’.

cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub

If you can access the instance itself without typing the password while using the following command, you have finished the SSH setting.

ssh localhost

Step 6. Create AMI

From this part, we will copy the instance.

If you right-click the instance, you can find the Create Image button.
Just name it as 'hadoop-node'
You can find the new image in the AMI section.

Step 7. Launch Two more instances

Now with the Image that we made, we can launch the exactly the same instances.

  • Select t2.medium
  • Set 2 instances
  • Select the same security group
If the image is prepared, now you should launch new instances.
Select t2.medium again.
Put two for the number of instance.
And choose the same security group that you created just before.
Okay, two more instances are launching, and I named them as master, slave1, and slave2 respectively.

Step 8. Update the security group

It is really important to use security group properly, to open the ports and to limit the access permission.

If you just open all the ports to the world, you will get DoS attack by random people in the world. (It really happened to me and your AWS cost will be over $100 easily.)

Go to the security group and choose the group that we made for the Hadoop cluster.
Okay, here is the important part, we are opening all the port by choosing 'All traffic', but the source will be the nodes using this security group.
You just need to search 'hadoop' from the source form then it will auto-complete the full name.

Step 9. Make a ssh connection

Okay, when I was following others tutorial, I used to skip this part, and I always regretted. Please use a hostname instead of IP address.

Back to the first instance again.

sudo vi /etc/hosts
This is really important! You should use private IP of each instances

Name them like line 2 to 4.

Then, it should be able to connect each instance using the following command without typing the password.

ssh slave1
ssh slave2

To escape from the slave and go back to master, just type 'exit'.

Lastly, update each instance's hosts file as well with following command.

cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'"
cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"

Step 10. Test Hadoop Cluster

First, you need to initialize HDFS.

hdfs namenode -format
If you see these messages, it is working fine.

And start the HDFS related processes.

start-dfs.sh
If you type jps and see the Namenode and Secondary NameNode.
If access to one of the slaves, and type jps, you will see DataNode instead.

Make a directory and check if it is made.

hadoop fs -mkdir /test
hadoop fs -ls /
If you can see the consoles like above, it is working.

To see, the datanodes, you can type the command like below.

hdfs dfsadmin -report
Two datanodes are having exactly the same amount of files(24KB) since our number of replication is 2.

Now, let's start yarn with following command.

start-yarn.sh
From the master node, you will see processes like above, from slave node, you will see DataNode and NodeManger.

You should see two slave nodes, if you type following command.

yarn node -list

Okay, let's test if our yarn is working properly.

First, upload the file to HDFS with following command.

hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/

And run the wordcount example with the command below.

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output

If you read the file,

hadoop fs -text /test/output/*
It should be like this.

Step 11(Optional). Test MRjob

https://github.com/Yelp/mrjob

Don't forget to stop/terminate the instances after your job if you don't want to pay unnecessarily.

Thank you!



19 thoughts on “Hadoop 101: Multi-node installation using AWS EC2

  1. I was able to set up my cluster and run the wordcount example on it the first time. I followed your blog entirely. Then I stopped my EC2 instances and restarted them the next day. Then, I again started the hdfs and yarn … but now I see that only the NameNode and SecondaryNameNode start. The DataNodes do not start. What could be the reason? The private IPs of the EC2 instances have not changed after stopping / restarting. I reset the passwordless-SSH once again to check, but it did not help. Does something come to your mind? Thank you.

    1. Hi, I haven’t really checked if it is working after stop and restart the instances.
      However, I am guessing you may have stopped the instance without stopping the HDFS and Yarn processes. Therefore, there is some unsynchronized(?) data leftover in the data node.

      I recommend you to delete the data in the data nodes with the following commands.
      Stop the processes first.
      (from master node)
      stop-dfs.sh
      stop-yarn.sh

      And remove the data file.
      (from all nodes)
      rm -rf /usr/local/hadoop/data/dataNode

      Try to start from STEP 10 again.

      If it doesn’t work please let me know, I will test from my side.

    1. Normally they find it automatically since Hadoop is already executable by step 2.
      Unless you place the rest of Hadoop files into an unusual location, it will be fine.

      If you want to specify the HADOOP_HOME.

      Either you can temporarily add using this command.

      export HADOOP_HOME=/usr/local/hadoop

      or you can add it permanently above command line to the bottom of “~/.bashrc”

      or you can add the path into /etc/environment like JAVA_HOME.
      (You shouldn’t put the export command in this case)

      Again, the path can be different based on your location of the Hadoop library.

  2. I got error when I run “hdfs namenode -format”. The error message shown as below:

    /usr/local/hadoop/bin/hdfs: line 319: /usr/lib/jvm/java-8-oracle/jre/bin/java: No such file or directory

    How do I fix this issue? Thank you.

    1. Hi!
      Did you install java using the following command?

      sudo add-apt-repository ppa:webupd8team/java
      sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default

      If you did, what do you see when you type

      java -version

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: