In this post, we will build the Hadoop cluster system using three EC2 instances ( one for master, two for slaves).
(I will assume that you know how to use AWS. If you don’t know please check this link)
To run Map-Reduce task, you need enough memory. Therefore, we will use t2.medium type instance.
(If you are students and need some free credit, check this link.)
- AWS EC2 t2.medium×3 (1 for a name node, 2 for data nodes)
- Ubuntu 18.04
- Hadoop 3.1.1
The plan is, we will do all the settings using one instance first. And then make an image of that instance. Then we will launch two more instances from that image.
!!!Don’t forget to stop/terminate the instances after your job if you don’t want to pay unnecessarily.
Step 1. Launch EC2 Instance.
Remember that we will set the environment as much as possible on this instance first.
Step 2. Update the Packages and Download Hadoop
Before we get started to install Hadoop, we need to prepare basic working enviornment.
Type following command to install the relevant software and packages.
sudo add-apt-repository ppa:webupd8team/java sudo apt-get update && sudo apt-get install -y build-essential python oracle-java8-set-default
If you face any types of prompt, just type yes or ok, do whatever you can agree.
Now, let’s download Hadoop and set it up.
wget http://apache.claz.org/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz tar -xzvf hadoop-3.1.1.tar.gz sudo mv hadoop-3.1.1 /usr/local/hadoop
Then, we have to set the path for Hadoop and Java
sudo vi /etc/environment
Okay, now the basic installation is done. You can test the basic map-reduce program using the following command.
/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount /usr/local/hadoop/LICENSE.txt ~/output cat ~/output/part-r-*
Step 3. Set up Hadoop Enviroment
Based on my experience, the most important thing in this step is, just double, triple checking.
sudo vi /usr/local/hadoop/etc/hadoop/hdfs-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/core-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/yarn-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/mapred-site.xml
sudo vi /usr/local/hadoop/etc/hadoop/workers
sudo vi /usr/local/hadoop/etc/hadoop/masters
Step 5. Set SSH
I think this is the most confusing part for the people who are not familiar with Linux or system stuff.
The purpose of this step is, giving the master node the access permission for the slave nodes. It doesn’t need to be the other way.
Firstly, generate the key with the following command. (Just press enter when the instance asks you to type something.)
ssh-keygen -t rsa
And append it to ‘authorized_keys’.
cat >> ~/.ssh/authorized_keys < ~/.ssh/id_rsa.pub
If you can access the instance itself without typing the password with the following command, you are done with SSH setting.
Step 6. Create AMI
From this part, we will copy the instance.
Step 7. Launch Two more instances
Now with the Image that we made, we can launch the exactly the same instances.
- Select t2.medium
- Set 2 instances
- Select the same security group
Step 8. Update the security group
It is really important to use security group properly, to open the ports and to limit the access permission.
If you just open all the ports to the world, you will get DoS attack by random people in the world. (It really happened to me and your AWS cost will be over $100 easily.)
Step 9. Make a ssh connection
Okay, when I was following others tutorial, I used to skip this part, and I always regretted. Please use a hostname instead of IP address.
sudo vi /etc/hosts
Then, it should be able to connect each instance using the following command without typing the password.
ssh slave1 ssh slave2
To escape from the slave and go back to master, just type 'exit'.
Lastly, update each instance's hosts file as well with following command.
cat /etc/hosts | ssh slave1 "sudo sh -c 'cat >/etc/hosts'" cat /etc/hosts | ssh slave2 "sudo sh -c 'cat >/etc/hosts'"
Step 10. Test Hadoop Cluster
First, you need to initialize HDFS.
hdfs namenode -format
And start the HDFS related processes.
Make a directory and check if it is made.
hadoop fs -mkdir /test hadoop fs -ls /
To see, the datanodes, you can type the command like below.
hdfs dfsadmin -report
Now, let's start yarn with following command.
You should see two slave nodes, if you type following command.
yarn node -list
Okay, let's test if our yarn is working properly.
First, upload the file to HDFS with following command.
hadoop fs -put /usr/local/hadoop/LICENSE.txt /test/
And run the wordcount example with the command below.
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.1.jar wordcount hdfs:///test/LICENSE.txt /test/output
If you read the file,
hadoop fs -text /test/output/*
Step 11(Optional). Test MRjob
If you are planning to work with MRjob, you try with following commands.
git clone https://github.com/Yelp/mrjob.git cd mrjob/ sudo python setup.py install
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts
Don't forget to stop/terminate the instances after your job if you don't want to pay unnecessarily.