We have set up the Hadoop environment from the previous post.
And YES! It IS a hassle unless you need your own tuned version of the environment.
Therefore, I’ll introduce a more convenient way to use Hadoop environment from this post.
We’ll test MRjob or PySpark using AWS EMR.
In part 1 we’ll launch the EMR and use it very naively (static instances and using HDFS).
From part 2 we’ll use EMR more correctly (?) (using AWS CLI and S3)
Again, if you haven’t get the student credit, check this link.
STEP 1. Launch EMR Cluster
STEP 2. ACCESSING MASTER NODE
In my opinion, EMR is not encouraging to access master node using SSH.
However, we’ll try that anyway in this Part 1 post. Then we’ll practice the recommended or proper way of using EMR.
Before we get started, we need to understand the concept of EMR.
This is not a new service. Technically, we’re just deploying a fully configured Hadoop cluster using EC2 instances. Therefore, if you go to your EC2 console after you create the cluster, you will see the three instances that they’re running.
Now, it’s just like the way that we use AWS EC2 instances.
To access the master node, we need to check if the security group has opened the 22 port.
Okay, now let’s try to connect. You can find the auto-generated ssh command from the console like the picture below.
Yes, just like a regular Hadoop cluster.
However, there is something annoying.
You can’t use git from this node.
You can’t install either since ‘hadoop’ is not a root account.
But, don’t worry.
Let’s login again with different account
After you finish the installation, log out again, and log in with the account name “hadoop”.
STEP 3. TESTING SAMPLE CODE
Let’s test the word frequent example again, with couple gigabytes of file.
git clone https://github.com/thejungwon/AWS_EMR cd AWS_EMR curl -L -o actors.list https://www.dropbox.com/s/vofyl0uryectfyt/actors.list?dl=1 hadoop fs -mkdir /test hadoop fs -put actors.list /test/actors.list virtualenv -p python3 venv . venv/bin/activate pip install mrjob cp .mrjob.conf ~/.mrjob.conf python mr_word_freq_count.py -r hdfs:///test/actors.list --output-dir hdfs:///test/output
*It’s important to copy and paste the ‘.mrjob.conf file. Otherwise, you’ll experience an error.
I’m assuming that you’re familiar with basic mrjob task.
If you complete the all the commands above you will just see normal map-reduce output log.
STEP 4. (OPTIONAL) CONFIGURING PARAMETERS
Usually, the performance of MRjob can be improved with finding optimal prameter for several configuration.
You need some insight into the system to find the right parameter, but we can still try with some arbitrary changes.
1. Number of mappers and reducers
With passing number of reducer or mapper to MRjob as shown in below command, you can change that setting.
python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.reduces=15 hdfs:///test/actors.list --output-dir hdfs:///test/output01
python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.maps=30 hdfs:///test/actors.list --output-dir hdfs:///test/output02
python mr_word_freq_count.py -r hadoop --jobconf mapreduce.job.maps=30 --jobconf mapreduce.job.reduces=15 hdfs:///test/actors.list --output-dir hdfs:///test/output03
As usual in computer science, a high number of mappers and reducers doesn’t always guarantee better performance.
2. Number of replicas
You can change the number of replicas of a particular file using the following command.
hadoop fs –setrep –w 2 /test/actors.list
But you may think, “What’s the relation between # of replica and performance?”. I will leave this question for you.
3. Number of worker nodes
In fact, this is the most effective way. The number of additional workers will guarantee better performance as long as the file is big enough.
I recommend you try the first and second configuration again after you increase the number of worker nodes.
It’s important to distribute the data blocks to all the nodes. Otherwise, only some nodes will work, and you won’t experience any performance improvement.
Then all workers will participate!
STEP 5. (OPTIONAL) THINGS TO THINK ABOUT
How can we decide the number of mappers (map tasks) or reducers (reduce tasks)? Based on the number of worker nodes? Based on the number of processors?
What’s the ideal ratio between the number of mappers and reducers? Does it even exist?
If we have 10 workers, is it always better to set 10 replicas?
If we have a total of 10 cores or 10 workers with single cores, is there any benefit to having more than ten reducers?
Okay, we have tested a very naive way of using EMR.
Don’t forget to stop the EMR cluster!
Wait a minute, but there is no stop button (only terminate).
If we terminate the cluster, then all the data will vanish.
That’s why we need to use S3 as our persistent storage.
We’ll find out from the next post how we can use S3.
See you later!