Hadoop 101: Multi-node installation using AWS EC2

Hadoop 101: Multi-node installation using AWS EC2

In this post, we will build the multi-node Hadoop cluster using three EC2 instances ( one for master, two for slaves).

(I will assume that you know how to use AWS. If you don’t, please check this link)



To run Map-Reduce task properly, you need enough memory. Therefore, we will use t2.medium type instance.

(If you are a student and need some free credit, check this link.)

  • AWS EC2 t2.medium×3 (1 for a name node, 2 for data nodes)
  • Ubuntu 18.04
  • Hadoop 3.1.1

Here is the plan.
First of all, we will do all the setting into one instance. Then we will make an image of that instance. After that, we will launch two more instances using that image.

!!!Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.

Step 1. Launch EC2 Instance.

Choose Ubuntu 18.04
Choose t2.medium
Put 15 GB for the Storage Size
[Important] Create security group with new name, and open only 22 port.
Then, if you click “Review and Launch”,
It will launch like this.
If you can connect to the instance using SSH, step 1 is done.

Remember that we will set the Hadoop environment as much as possible on this instance.

Step 2. Update the Packages and Download Hadoop

Before we get started to install Hadoop, we need to prepare basic working environment.

Type following command to install the relevant software and packages.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default

If you face any types of prompt, just type yes or ok, do whatever you can agree.

Now, let’s download Hadoop and set it up.

wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz
tar -xzvf hadoop-3.1.1.tar.gz
sudo mv hadoop-3.1.1 /usr/local/hadoop

Then, we have to set the path for Hadoop and Java

sudo vi /etc/environment

Change the content with following PATH.

PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin"
JAVA_HOME="/usr/lib/jvm/java-8-oracle/jre"

Then, apply the new PATH using the following command.

source /etc/environment

Okay, now the basic installation is done. You can test the basic map-reduce code using the following command.

hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output
cat ~/output/part-r-*
Great!

!If you see following error that JAVA_HOME is not set.


Then let’s just export JAVA_HOME again.

export JAVA_HOME=/usr/lib/jvm/java-8-oracle/jre

Step 3. Set up Hadoop Enviroment

Based on my experience the most important thing in this step is checking the code double, triple time.

sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/workers
sudo vi /usr/local/hadoop/etc/hadoop/masters

Step 5. Set SSH

I think this is the most confusing part for the people who are not familiar with Linux or system stuff.

The purpose of this step is giving the master node the ssh access permission for the slave nodes. (It doesn’t need to be done to the other way.)

Firstly, generate the key using the following command. (Just press enter when the instance asks you to type something.)

ssh-keygen -t rsa

And append it to ‘authorized_keys’.

cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub

If you can access the instance itself without typing the password while using the following command, you have finished the SSH setting.

ssh localhost

Step 6. Create AMI

From this part, we will copy the instance.

If you right-click the instance, you can find the Create Image button.
Just name it as 'hadoop-node'
You can find the new image in the AMI section.

Step 7. Launch Two more instances

Now with the Image that we made, we can launch the exactly the same instances.

  • Select t2.medium
  • Set 2 instances
  • Select the same security group
If the image is prepared, now you should launch new instances.
Select t2.medium again.
Put two for the number of instance.
And choose the same security group that you created just before.
Okay, two more instances are launching, and I named them as master, slave1, and slave2 respectively.

Step 8. Update the security group

It is really important to use security group properly, to open the ports and to limit the access permission.

If you just open all the ports to the world, you will get DoS attack by random people in the world. (It really happened to me and your AWS cost will be over $100 easily.)

Go to the security group and choose the group that we made for the Hadoop cluster.
Okay, here is the important part, we are opening all the port by choosing 'All traffic', but the source will be the nodes using this security group.
You just need to search 'hadoop' from the source form then it will auto-complete the full name.

Step 9. Make a ssh connection

Okay, when I was following others tutorial, I used to skip this part, and I always regretted. Please use a hostname instead of IP address.

Back to the first instance again.

sudo vi /etc/hosts
This is really important! You should use private IP of each instances

Name them like line 2 to 4.

Then, it should be able to connect each instance using the following command without typing the password.

ssh slave1
ssh slave2

To escape from the slave and go back to master, just type 'exit'.

Lastly, update each instance's hosts file as well with following command.

cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'"
cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"

Step 10. Test Hadoop Cluster

First, you need to initialize HDFS.

hdfs namenode -format
If you see these messages, it is working fine.

And start the HDFS related processes.

start-dfs.sh
If you type jps and see the Namenode and Secondary NameNode.
If access to one of the slaves, and type jps, you will see DataNode instead.

Make a directory and check if it is made.

hadoop fs -mkdir /test
hadoop fs -ls /
If you can see the consoles like above, it is working.

To see, the datanodes, you can type the command like below.

hdfs dfsadmin -report
Two datanodes are having exactly the same amount of files(24KB) since our number of replication is 2.

Now, let's start yarn with following command.

start-yarn.sh
From the master node, you will see processes like above, from slave node, you will see DataNode and NodeManger.

You should see two slave nodes, if you type following command.

yarn node -list

Okay, let's test if our yarn is working properly.

First, upload the file to HDFS with following command.

hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/

And run the wordcount example with the command below.

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output

If you read the file,

hadoop fs -text /test/output/*
It should be like this.

Step 11(Optional). Test MRjob

https://github.com/Yelp/mrjob

Don't forget to stop/terminate the instances after your job if you don't want to pay unnecessarily.

Thank you!



41 thoughts on “Hadoop 101: Multi-node installation using AWS EC2

  1. I was able to set up my cluster and run the wordcount example on it the first time. I followed your blog entirely. Then I stopped my EC2 instances and restarted them the next day. Then, I again started the hdfs and yarn … but now I see that only the NameNode and SecondaryNameNode start. The DataNodes do not start. What could be the reason? The private IPs of the EC2 instances have not changed after stopping / restarting. I reset the passwordless-SSH once again to check, but it did not help. Does something come to your mind? Thank you.

    1. Hi, I haven’t really checked if it is working after stop and restart the instances.
      However, I am guessing you may have stopped the instance without stopping the HDFS and Yarn processes. Therefore, there is some unsynchronized(?) data leftover in the data node.

      I recommend you to delete the data in the data nodes with the following commands.
      Stop the processes first.
      (from master node)
      stop-dfs.sh
      stop-yarn.sh

      And remove the data file.
      (from all nodes)
      rm -rf /usr/local/hadoop/data/dataNode

      Try to start from STEP 10 again.

      If it doesn’t work please let me know, I will test from my side.

    1. Normally they find it automatically since Hadoop is already executable by step 2.
      Unless you place the rest of Hadoop files into an unusual location, it will be fine.

      If you want to specify the HADOOP_HOME.

      Either you can temporarily add using this command.

      export HADOOP_HOME=/usr/local/hadoop

      or you can add it permanently above command line to the bottom of “~/.bashrc”

      or you can add the path into /etc/environment like JAVA_HOME.
      (You shouldn’t put the export command in this case)

      Again, the path can be different based on your location of the Hadoop library.

  2. I got error when I run “hdfs namenode -format”. The error message shown as below:

    /usr/local/hadoop/bin/hdfs: line 319: /usr/lib/jvm/java-8-oracle/jre/bin/java: No such file or directory

    How do I fix this issue? Thank you.

    1. Hi!
      Did you install java using the following command?

      sudo add-apt-repository ppa:webupd8team/java
      sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default

      If you did, what do you see when you type

      java -version

  3. when I tried to execute start-yarn.sh from the master.. The console displays this error:
    start-yarn.sh
    Starting resourcemanager
    ERROR: Cannot set priority of resourcemanager process 15218
    Starting nodemanagers
    slave1: ERROR: Cannot set priority of nodemanager process 18726
    slave2: ERROR: Cannot set priority of nodemanager process 8501
    I didn’t find the solution in all the forum that I visited..

  4. 안녕하세요 하둡 얀을 활용한 프로젝트를 진행해보려고 공부 중인 학생입니다~
    글이 좋은 참고가 됐는데요, 궁금한게 있는데 하둡 클러스터를 형성하는 과정이 step 9에서의 /etc/hosts에 ip를 입력하는 것만으로 가능한건가요?

    1. 안녕하세요! Step 9에서는 master, slave1, slave2와 각각의 ip주소를 매핑시켜주는 (또는 alias) 하는 과정이고, 실제적으로 클러스터를 구성하는 과정은 step3에서 이뤄진다고 보시면 될 것 같습니다. 그때 쓰이는 master나 slave1, slave2들이 실제로 의미하는 것은 ip주소가 될 텐데, 이게 주소로 계속 입력을 해버리면, 비효율적이어서 보통 이렇게 hostname을 이용해서 작성합니다.

      1. 아아 step 3에서 workers / masters에서 입력해두는걸로 구성이 되는 구조인가보군요? step 9에서는 그 workers와 masters의 실제 ip값을 적어두는거구요?

        프로젝트를 위해 독학하고 있는데 하둡 자체가 정보가 너무 적어서 고생하고 있는데 이 게시글이 한줄기 빛 같네요ㅜㅜ 감사합니다

  5. 잘봤습니다. 혹시 마지막 부분에서 YARN을 실행시킨 후 wordcount를 돌리면 연결된 노드들에서 분산처리가 수행되나요?

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: