Debugging Hadoop 2.x on Amazon EMR

Hadoop 2.x upgrades the previous web UI with a detailed ResourceManager. Having previously browsed the simpler JobTracker UI of Hadoop 1.x using lynx on the master node, finding things on the new interface took a bit of experimentation.

Proxy Settings

Open a proxy connection using the EMR API:

aws emr socks --cluster-id j-1234567890ABC --key-pair-file ssh_key.pem

Then configure your browser to use the proxy for connections to the EMR nodes. I use the following configuration template in foxyproxy, based on the example in the EMR docs, for accessing the Hadoop web interfaces. The first three URL patterns are from EMR's suggested proxy settings and cover basic browsing of pages hosted on the EMR nodes, but internal links frequently end up using an ec2.internal domain, so I've added a rule for *ec2.internal:*.

<?xml version="1.0" encoding="UTF-8"?>
<foxyproxy>
    <proxies>
        <proxy name="emr-socks-proxy" id="2322596116" notes="" fromSubscription="false" enabled="true" mode="manual" selectedTabIndex="2" lastresort="false" animatedIcons="true" includeInCycle="true" color="#0055E5" proxyDNS="true" noInternalIPs="false" autoconfMode="pac" clearCacheBeforeUse="false" disableCache="false" clearCookiesBeforeUse="false" rejectCookies="false">
            <matches>
                <match enabled="true" name="*ec2*.amazonaws.com*" pattern="*ec2*.amazonaws.com*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
                <match enabled="true" name="*ec2*.compute*" pattern="*ec2*.compute*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
                <match enabled="true" name="10.*" pattern="http://10.*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
                <match enabled="true" name="ec2.internal" pattern="*ec2.internal:*" isRegEx="false" isBlackList="false" isMultiLine="false" caseSensitive="false" fromSubscription="false" />
            </matches>
            <manualconf host="localhost" port="8157" socksversion="5" isSocks="true" username="" password="" domain="" />
        </proxy>
    </proxies>
</foxyproxy>

Useful Hadoop Interface Pages

ResourceManager

ec2-*.amazonaws.com:9026

Port 9026 on the master's public DNS accesses the Hadoop Resource Manager, which provides a good entry point to a number of parts of running and past map reduce jobs.

MR Jobs

To get more information about specific jobs, select the application from the Resource Manager's list. Logs generated by this map/reduce job in the job tracker are available from here.

Further clicking on the link in the "Tracker URL" field (labeled "History" for completed jobs and "Application Master" for running jobs) goes to pages with more detail:

Click on a job in the Application Master to get stats on the pending/running/completed mappers and reducers. There's also failure listings, from which one can look at the associated error messages and logs from the failed mappers or reducers.

Hive Logs

ec2-*.amazonaws.com:9026/logs/

From the Resource Manager, selecting "Local Logs" from "Tools" to get a directory listing of logs from the master node. Logs from execution of high-level jobs enqueued through the EMR API, such as hive scripts, can be found here under/steps/ followed by their step-ID.

HDFS Utilization

ec2-*.amazonaws.com:9101

The NameNode web interface offers useful stats on disk utilization. Many of these same metrics are also logged to Cloudwatch, but extra detail can be found from the cluster while it's still running.

Contact us. Let's create magic together.

Our Newsletter is good. Sign up so we can deliver the goods. (Not bad, huh?)

Request a call