AWS EMR Tutorial – Part 1

AWS EMR Tutorial – Part 1

Hello!

We have set up the Hadoop environment from the previous post.

And YES! It IS a hassle unless you need your own tuned version of the environment.

Therefore, I’ll introduce a more convenient way to use Hadoop environment from this post.

We’ll test MRjob or PySpark using AWS EMR.

In part 1 we’ll launch the EMR and use it very naively (static instances and using HDFS).

From part 2 we’ll use EMR more correctly (?) (using AWS CLI and S3)

Again, if you haven’t get the student credit, check this link.



STEP 1. Launch EMR Cluster

Find the service
Click create cluster
Try to choose the default values that your console gives you.
If you already have a key pair then let’s just use that.
Let’s wait a bit!
If you see the ‘Waiting’ sign, then now it’s running.

STEP 2. ACCESSING MASTER NODE

In my opinion, EMR is not encouraging to access master node using SSH.

However, we’ll try that anyway in this Part 1 post. Then we’ll practice the recommended or proper way of using EMR.

Before we get started, we need to understand the concept of EMR.

This is not a new service. Technically, we’re just deploying a fully configured Hadoop cluster using EC2 instances. Therefore, if you go to your EC2 console after you create the cluster, you will see the three instances that they’re running.

Three instances are running

Now, it’s just like the way that we use AWS EC2 instances.

To access the master node, we need to check if the security group has opened the 22 port.

Click the security group in the red box.

Choose the master related security group (you can recognize it from the group name).
And add 22 port additionally like this picture.

Okay, now let’s try to connect. You can find the auto-generated ssh command from the console like the picture below.

If you click the redbox you will find the command.
Copy and Paste, but mind the path of the key file.
Now we’re in the master node.
You can check the status of HDFS.
You can check the status of Yarn.
You can check the contents in HDFS

Yes, just like a regular Hadoop cluster.

However, there is something annoying.

You can’t use git from this node.

It’s not installed.

You can’t install either since ‘hadoop’ is not a root account.

But, don’t worry.

Let’s login again with different account

use ‘ec2-user’ instead of ‘hadoop’
Then install git with the following command.

After you finish the installation, log out again, and log in with the account name “hadoop”.

STEP 3. TESTING SAMPLE CODE

Let’s test the word frequent example again, with couple gigabytes of file.

git clone https://github.com/thejungwon/AWS_EMR
cd AWS_EMR

curl -L -o actors.list https://www.dropbox.com/s/vofyl0uryectfyt/actors.list?dl=1
hadoop fs -mkdir /test
hadoop fs -put actors.list /test/actors.list

virtualenv -p python3 venv
. venv/bin/activate
pip install mrjob
cp .mrjob.conf ~/.mrjob.conf
python mr_word_freq_count.py -r hdfs:///test/actors.list --output-dir hdfs:///test/output

*It’s important to copy and paste the ‘.mrjob.conf file. Otherwise, you’ll experience an error.

I’m assuming that you’re familiar with basic mrjob task.

If you complete the all the commands above you will just see normal map-reduce output log.

Result logs
You can also monitor in the AWS console.

STEP 4. (OPTIONAL) CONFIGURING PARAMETERS

This is the number of map and reduce tasks information without additional configuration.

Usually, the performance of MRjob can be improved with finding optimal prameter for several configuration.

You need some insight into the system to find the right parameter, but we can still try with some arbitrary changes.

1. Number of mappers and reducers

With passing number of reducer or mapper to MRjob as shown in below command, you can change that setting.

python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.reduces=15 hdfs:///test/actors.list --output-dir hdfs:///test/output01
python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.maps=30 hdfs:///test/actors.list --output-dir hdfs:///test/output02
python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.maps=30 --jobconf mapreduce.job.reduces=15 hdfs:///test/actors.list --output-dir hdfs:///test/output03

As usual in computer science, a high number of mappers and reducers doesn’t always guarantee better performance.

2. Number of replicas

When you check the file in HDFS, the number in the redbox means the number of replicas.

You can change the number of replicas of a particular file using the following command.

hadoop fs –setrep –w 2 /test/actors.list
Then try to run the example code if there are some changes.

But you may think, “What’s the relation between # of replica and performance?”. I will leave this question for you.

3. Number of worker nodes

In fact, this is the most effective way. The number of additional workers will guarantee better performance as long as the file is big enough.

I recommend you try the first and second configuration again after you increase the number of worker nodes.

You can resize the number of worker nodes by clicking that redbox.
Let’s go for four.
It takes some time.

It’s important to distribute the data blocks to all the nodes. Otherwise, only some nodes will work, and you won’t experience any performance improvement.

hdfs balancer
I have run this command several times so that the log might be a bit different.

Then all workers will participate!

This is just my own way to see if all nodes are working together.

STEP 5. (OPTIONAL) THINGS TO THINK ABOUT

How can we decide the number of mappers (map tasks) or reducers (reduce tasks)? Based on the number of worker nodes? Based on the number of processors?

What’s the ideal ratio between the number of mappers and reducers? Does it even exist?

If we have 10 workers, is it always better to set 10 replicas?

If we have a total of 10 cores or 10 workers with single cores, is there any benefit to having more than ten reducers?


Finishing

No stop button

Okay, we have tested a very naive way of using EMR.

Don’t forget to stop the EMR cluster!

Wait a minute, but there is no stop button (only terminate).

If we terminate the cluster, then all the data will vanish.

That’s why we need to use S3 as our persistent storage.

We’ll find out from the next post how we can use S3.

See you later!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: