Showing posts with label hadoop. Show all posts
Showing posts with label hadoop. Show all posts

Tuesday, May 07, 2013

Hadoop Beginner's Guide

Hadoop Beginner's Guide by Garry Turkington
ISBN: 1849517304

Hadoop Beginner's Guide is, as the title suggests, a new introductory book to the Hadoop ecosystem.  It provides an introduction to how to get up and running with the core components of Hadoop (Map-Reduce and HDFS),  some higher level tools like Hive, integration tools like Sqoop and Flume, and it also provides some good starting information relating to operational issues with Hadoop. This is not an exhaustive reference like Hadoop: The Definitive Guide, and for a beginner, that's probably a good thing.  (In my day, we only had The Definitive Guide, and we liked it!)

Most of the topics are covered in a "dive right in" format.  After some brief introduction to the topic the author provides a list of commands or a block of code and invites you to run it.  This is followed by "What just happened?" that explains the details of the operation or code.  Personally, I don't care for that too much because the explanation is sometimes separated from the code by multiple pages, which was a real hassle reading this as a PDF.  But, maybe that's just me.

As I mentioned, the book includes a couple of chapters on operations, which I found to be a nice addition to a beginner's book.  Some of these operational details were explained by hands-on experiments like shutting down processes or nodes, in which case "What just happened?" is more like "What just broke?"  The operational scenarios are by no means exhaustive (that's what you learn from production), but they provide the reader with some "real life" experience gained in a low-risk environment.  And, they introduce a powerful method to learn more operational details: set up an experiment and find out what happens.  Learning to learn is the most valuable thing you can gain from any book, class, or seminar.

Another nice feature of this book that I haven't seen in others is that the author includes examples of Amazon EC2 and Elastic Map Reduce (EMR).  There are examples of both Map Reduce and Hive jobs on EMR.  He doesn't do everything with "raw" Map Reduce and EMR because once you know the basics of EMR, the same principles apply to both raw Hadoop and EMR.

I do have some complaints about the book, but many of them are nit-picking or personal style.  That said, I think the biggest thing this book would benefit from would be some very detailed "technical editing."  By that I mean there are technical details that got corrupted during the book production process.  For example, the hadoop command is often rendered as Hadoop in examples.  There are plenty of similar formatting and typographic errors. Of course, an experienced Hadoop user wouldn't be tripped up by these, but this is a "beginner's guide," and such details can cause tremendous pain and suffering for newbies.

To wrap things up, Hadoop Beginner's Guide is a pretty good introduction to the Hadoop ecosystem.  I'd recommend it to anyone just starting out with Hadoop before moving on to something more reference-oriented like The Definitive Guide.

enjoy,
Charles.




FTC disclaimer: I received a free review copy of this book from DZone.  The links to Amazon above contain my Amazon Associates tag.

Friday, November 05, 2010

A Fix for "Exceeded MAX_FAILED_UNIQUE_FETCHES" in Hadoop

In a project I'm currently working on, we're moving a bunch of our back-end processing to Hadoop.  We started a two-node cluster: one master, one slave.  That seemed to work fine.  Then, we went to four nodes, and about the same time I was testing out a new Hadoop job.  The (single) reducer was hanging with this somewhat cryptic message:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

I went out to the slave node and looked through the job logs, and I could see that it was timing out trying to transfer data from one of the other slave nodes - an upstream mapper node. Upon closer scrutiny of the the log file I realized that Hadoop was trying to transfer from the other slave's public IP address, which is behind a firewall that blocks public access.

Key take-away number one: when you're just starting out with Hadoop, if you only have one slave, you've only demonstrated one real communication path: master-to-slave. Your cluster isn't doing any slave-to-slave transfers because everything was on the one slave. Also, our initial job had no reducer, so it ran fine on the new, 4-node cluster because it was still only master-slave communication.

For some reason, the mapper slave was advertising the location of the map output data via its public IP address. My first attempt at fixing this problem involved the dfs.datanode.dns.interface configuration parameter (and it's mapred equivalent). This tells Hadoop that when a process (mapred or dfs) wants to figure out it's host name, use the IP address associated with the given interface. (You could even have dfs and mapred using separate interfaces for additional throughput.)

This failed for me because I had one interface with two addresses, not two interfaces. I dug through the Hadoop DNS code (org.apache.hadoop.net.DNS - God, I love open-source: you can just look for yourself) and saw that if there is one interface, the code loops through the IP addresses and performs reverse DNS lookups and takes the first successful result. I was fortunate in that the private IP address was coming up first in that enumeration of the IPs on the interface, but it still wasn't working. I talked to our system admin/configuration guru. It turns out that our hosting provider doesn't provide reverse DNS for those private IP addresses. We could have set up our own DNS server for just these reverse lookups, but there was a brute-force option available to us.

You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide. Now, in theory, this might be onerous - it means you have a different configuration file per-slave. However, our cluster is configured and maintained via Puppet. So, our puppet master just tweaked his Puppet script, and we never looked back.

Take-away number two: Exceeded MAX_FAILED_UNIQUE_FETCHES could mean a simple connectivity problem. I'm sure there are other possible causes, but an inability to connect between slaves is comparatively simple to troubleshoot.

enjoy,
Charles.