In this post, we will build the multi-node Hadoop cluster using three EC2 instances ( one for master, two for slaves).
(I will assume that you know how to use AWS. If you don’t, please check this link)
To run Map-Reduce task properly, you need enough memory. Therefore, we will use t2.medium type instance.
(If you are a student and need some free credit, check this link.)
- AWS EC2 t2.medium×3 (1 for a name node, 2 for data nodes)
- Ubuntu 18.04
- Hadoop 3.1.1
Here is the plan.
First of all, we will do all the setting into one instance. Then we will make an image of that instance. After that, we will launch two more instances using that image.
!!!Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.
Step 1. Launch EC2 Instance.
Remember that we will set the Hadoop environment as much as possible on this instance.
Step 2. Update the Packages and Download Hadoop
Before we get started to install Hadoop, we need to prepare basic working environment.
Type following command to install the relevant software and packages.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default
If you face any types of prompt, just type yes or ok, do whatever you can agree.
Now, let’s download Hadoop and set it up.
wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz tar -xzvf hadoop-3.1.1.tar.gz sudo mv hadoop-3.1.1 /usr/local/hadoop
Then, we have to set the path for Hadoop and Java
sudo vi /etc/environment
Change the content with following PATH.
Then, apply the new PATH using the following command.
Okay, now the basic installation is done. You can test the basic map-reduce code using the following command.
hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output cat ~/output/part-r-*
!If you see following error that JAVA_HOME is not set.
Then let’s just export JAVA_HOME again.
Step 3. Set up Hadoop Enviroment
Based on my experience the most important thing in this step is checking the code double, triple time.
sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/workers
sudo vi /usr/local/hadoop/etc/hadoop/masters
Step 5. Set SSH
I think this is the most confusing part for the people who are not familiar with Linux or system stuff.
The purpose of this step is giving the master node the ssh access permission for the slave nodes. (It doesn’t need to be done to the other way.)
Firstly, generate the key using the following command. (Just press enter when the instance asks you to type something.)
ssh-keygen -t rsa
And append it to ‘authorized_keys’.
cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub
If you can access the instance itself without typing the password while using the following command, you have finished the SSH setting.
Step 6. Create AMI
From this part, we will copy the instance.
Step 7. Launch Two more instances
Now with the Image that we made, we can launch the exactly the same instances.
- Select t2.medium
- Set 2 instances
- Select the same security group
Step 8. Update the security group
It is really important to use security group properly, to open the ports and to limit the access permission.
If you just open all the ports to the world, you will get DoS attack by random people in the world. (It really happened to me and your AWS cost will be over $100 easily.)
Step 9. Make a ssh connection
Okay, when I was following others tutorial, I used to skip this part, and I always regretted. Please use a hostname instead of IP address.
Back to the first instance again.
sudo vi /etc/hosts
Then, it should be able to connect each instance using the following command without typing the password.
ssh slave1 ssh slave2
To escape from the slave and go back to master, just type ‘exit’.
Lastly, update each instance’s hosts file as well with following command.
cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'" cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"
Step 10. Test Hadoop Cluster
First, you need to initialize HDFS.
hdfs namenode -format
And start the HDFS related processes.
Make a directory and check if it is made.
hadoop fs -mkdir /test hadoop fs -ls /
To see, the datanodes, you can type the command like below.
hdfs dfsadmin -report
Now, let’s start yarn with following command.
You should see two slave nodes, if you type following command.
yarn node -list
Okay, let’s test if our yarn is working properly.
First, upload the file to HDFS with following command.
hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/
And run the wordcount example with the command below.
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output
If you read the file,
hadoop fs -text /test/output/*
Step 11(Optional). Test MRjob
Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.