Friday, November 05, 2010

A Fix for "Exceeded MAX_FAILED_UNIQUE_FETCHES" in Hadoop

In a project I'm currently working on, we're moving a bunch of our back-end processing to Hadoop.  We started a two-node cluster: one master, one slave.  That seemed to work fine.  Then, we went to four nodes, and about the same time I was testing out a new Hadoop job.  The (single) reducer was hanging with this somewhat cryptic message:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

I went out to the slave node and looked through the job logs, and I could see that it was timing out trying to transfer data from one of the other slave nodes - an upstream mapper node. Upon closer scrutiny of the the log file I realized that Hadoop was trying to transfer from the other slave's public IP address, which is behind a firewall that blocks public access.

Key take-away number one: when you're just starting out with Hadoop, if you only have one slave, you've only demonstrated one real communication path: master-to-slave. Your cluster isn't doing any slave-to-slave transfers because everything was on the one slave. Also, our initial job had no reducer, so it ran fine on the new, 4-node cluster because it was still only master-slave communication.

For some reason, the mapper slave was advertising the location of the map output data via its public IP address. My first attempt at fixing this problem involved the dfs.datanode.dns.interface configuration parameter (and it's mapred equivalent). This tells Hadoop that when a process (mapred or dfs) wants to figure out it's host name, use the IP address associated with the given interface. (You could even have dfs and mapred using separate interfaces for additional throughput.)

This failed for me because I had one interface with two addresses, not two interfaces. I dug through the Hadoop DNS code (org.apache.hadoop.net.DNS - God, I love open-source: you can just look for yourself) and saw that if there is one interface, the code loops through the IP addresses and performs reverse DNS lookups and takes the first successful result. I was fortunate in that the private IP address was coming up first in that enumeration of the IPs on the interface, but it still wasn't working. I talked to our system admin/configuration guru. It turns out that our hosting provider doesn't provide reverse DNS for those private IP addresses. We could have set up our own DNS server for just these reverse lookups, but there was a brute-force option available to us.

You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide. Now, in theory, this might be onerous - it means you have a different configuration file per-slave. However, our cluster is configured and maintained via Puppet. So, our puppet master just tweaked his Puppet script, and we never looked back.

Take-away number two: Exceeded MAX_FAILED_UNIQUE_FETCHES could mean a simple connectivity problem. I'm sure there are other possible causes, but an inability to connect between slaves is comparatively simple to troubleshoot.

enjoy,
Charles.

Tuesday, April 06, 2010

Django vs. Grails

When I came up with the Five Technologies in Five Weeks project, I hadn't intended to compare any of the technologies to each other directly.  My original ordering for the technologies didn't have any head-to-head match-ups, but I just finished my Grails project, which was "right after" (modulo interruptions) my Django/GAE project.  And, while I was working on Grails, I kept comparing it to Django, and I felt compelled to write up my observations.

"All things being equal, which they never are" (Manager Tools), I'd choose Django over Grails for a new, green-field project.  However, given the constraint to run in a Java environment, I'd gladly choose Grails over the other Java/J2EE frameworks I'm familiar with.  (Django on Jython would be a contender as would Wicket, but I don't have hands-on experience with either.)  This is not a slam on Grails.

Here are some of the reasons I prefer Django over Grails:
  • Development Speed (not runtime performance): running (a trivial number of) unit tests for Grails on my machine took about 15 seconds (with Folding@Home in the background).  Django unit tests typically only take me a couple of seconds (for a trivial number of tests).  Integration tests in Grails were 30-40 seconds; Django might be 5 seconds.  Although Grails will reload the running webapp when you make a change, sometimes things are really buggered, which requires a full restart.  That seemed to happen more often in Grails than Django, and restarts take ~20 seconds vs. 2 seconds.
    Compared to the 5-7 minutes I've seen WebSphere to restart an application, Grails is blazingly fast.  But the 15-30 seconds that many operations take in Grails is long enough for the mind to wander, which is not good.
    Update: a commenter pointed out the (under-documented) interactive mode of Grails.  That speeds the edit-test-edit loop considerably.
  • Voodoo: both Grails and Django do some voodoo behind the scenes to reduce the amount of chimp programming (e.g., non-DRY boilerplate) the programmer has to do.  This voodoo involves topics of the high priests like meta object programming that mere mortals typically don't have to worry about.  Maybe it's just that I don't have as much "time behind the wheel" with Grails, but I feel like its voodoo leaks out a bit and is too voodoo-rific.  A Grails programmer has to be aware of which methods are dynamically added to a class in order to separate unit tests (no voodoo) from integration tests (full voodoo). And, although I like the power of dynamic finder methods like User.findByName(), it kinda bugs me that the code for that doesn't exist somewhere that I can see (you know, on punch cards!)  You don't see the dynamic methods on a Django class, but there are a lot fewer of them, so it seems less voodoo-rific.  Again, maybe more time behind the wheel of Grails would make me feel more comfortable.
    (As an aside, when I taught Python and Django in Spring 2009, the students didn't even notice the voodoo of fields on models in Django until I pointed it out to them.  Then, I exploded their heads with MOP.)
  • View Technology: I have a long-standing personal preference for templates instead of ASP/JSP-like mark-up languages, and I'd lump Grails GSP pages in the latter category.  Back in the day, Jason Hunter wrote an essay called The Problems with JSP that really stuck with me.  GSPs are much, much better than the Model 1 JSP pages that Hunter talks about, but they still feel similar enough to make me think about using something like Velocity with Grails.  Django kicks templates up a notch by having inheritance with templates, which I really love. (Even if Django didn't invent template inheritance.)
  • Dynamic Language Issues: This is a real unfair comparison because I've been using Python for 15 years, but I felt that errors in Grails/Groovy were more cryptic and hard to find compared to Python.  If I typo the named parameter to a method in Python, it lets me know immediately.  More than once in Grails I misspelled a named parameter to a constructor, which failed silently and then lead to a validation failure later when I went to save the object.  Some of that is from lack of experience with Groovy/Grails - i.e., learning what error messages really mean.  But still, Python seems to fail in a more helpful way.  (I'd say Python fails gracefully, but if anything it's the opposite: very loud - "you dumb-ass, there is no parameter called recipent" when I misspelled recipient.)
I don't mean for this to be a hit-piece against Grails; I really like Grails.  I look forward to using it some more.  It's just that, for what it's worth, I like Django more.

5Tech - Week 2: Grails

For the second project and technology in five weeks I chose Grails.  I was very curious to try Grails on on a "real"/non-tutorial project to confirm its usability and productivity.  On my last contract gig, I was in a non-development role (configuration management) on a project that was using JSF 1.1/Spring/Hibernate and WebSphere, and their development looked really painful by my standards.  For example, they'd spend 5-7 minutes to redeploy the app just to inspect some HTML change in the JSF page.  I was looking to Grails to provide a much more productive environment.

The project involved creating a system to process events and route them to users.  It was inspired by a code base that I worked on for a previous client.  (This was done with their permission.)  They have a large code base that includes "application functionality" like this event routing, as well as a lot of "technical functionality" that they rolled themselves years ago before modern tools like Spring and Hibernate were created and became mainstream.  In a way, this was a small prototype to research the feasibility of reimplementing the application functionality on a more modern platform - Grails.

Analysis
All in all the project went very smoothly. Because Grails is comparatively mature, there exists a fair amount of documentation, including numerous books.  I leaned heavily on Grails in Action by Smith and Ledbrook.  I was able to follow their examples and adapt them to my application easily.  There just weren't any serious gaps or surprises.

Using Grails to create domain classes (database entities) was a breeze.  It's so nice to just declare the fields and their constraints and have Grails "do the right thing."  You don't have to bother with annotations, let alone XML configuration files.  Creating controllers to process web requests is trivial, and being able to scaffold them to get the basic CRUD functionality in place immediately was very conducive to high productivity.

I also liked the idea that services are a first-class concept in Grails (along with controllers and domain classes), which makes it very easy to program with them.   Grails wires them into the classes that use the services via Spring, but again, you don't have to monkey with Spring's applicationContext.xml file.  Finally, adding the REST interface for incoming events was almost trivial, especially since I already had the service in place.

Grails (like Rails) treats testing as a core concept, not something you wave your hands at after the fact.  It has the concept of unit and integration tests, and there are a number of functional test plug-ins, too.  I did a fair amount of unit and integration testing, which was a real live-saver.  Due to the very dynamic nature of Groovy, many of my typos were not caught in the compile phase, but exercising the code in tests did catch those.  I had very few issues when I ran the actual web application.

A lot of developers are unclear on the difference between unit testing and integration testing.  For better or worse, Gails makes some clear distinctions.  A lot of the Grails voodoo (e.g., dependency injection and adding dynamic methods to classes) is not available in unit tests, or you have to add it yourself via mocking or manual injection.  Thus, there is a simple distinction: if it runs under "test-app -unit", it's a unit test, otherwise it's an integration test.  This bit me once early on when I wrote a test that needed that higher-order functionality, but I put it in a unit test.  First I got a null pointer exception because the service hadn't been injected, then I got a missing method exception because the save method on the domain object hadn't been added dynamically. However, the fix was simple enough - move the test to the integration folder, and it ran fine with -integration.  Then, I copied it back, added a few mocks (very easy in Grails), and I had a unit test, too.

In terms of "project management," this step of the Five Technologies suffered from some "life happens" distractions.  Rather than running Monday through Friday, it ended up being Tuesday through Monday, and some of those days were a bit short-changed.  When I was on task, the Pomodoro Technique continued to be effective, and I replaced my dead watch/timer with a dedicated Android application called Pomodoro Tasks - it's even open-source.

In conclusion, Grails was another success for the Five Technologies in Five Weeks project.  I got the application functionality done that I expected.  Grails was very usable and productive - no nasty surprises.   What's next?  Probably, NetBeans Platform because I have documentation for it already whereas I'm waiting on some Android books.

Wednesday, March 24, 2010

5Tech - Week 1: Google App Engine

For the first of the five technologies in five weeks, I picked something easy - Google App Engine using Python and Django. As someone who's been using Python for 15 years, there was no language learning curve. And using the Django helper for app engine package allowed me to leverage my Django experience. So with a minimal learning curve, the results were basically all good.

The project involved creating a simple application to monitor web sites to check if they are up or down and notify the user about status changes. As such, the core of the application isn't even a web app. In fact, I've implemented the same thing a couple of different times as a standalone program in Python or Java. However, to run a standalone application like that requires a server where it can run, and that's not something I've always had access to. GAE's cron service provides the ability to run checks periodically - just like the main loop in my standalone applications, and GAE provides a large number of notification options, although I only used email initially. The application does have a web interface for configuring checks and viewing the status

Project Analysis

My biggest problem developing the application was the fact that I was working with three separate but overlapping tools/frameworks: GAE, Django, and the Django helper. These all worked fine, but when I wanted to do some task, I didn't know where to look for the "right" way to do it. After wasting a bunch of time searching for things only to find that the Django way was the right way, I just quit asking and adopted the "try Django first" strategy. And then, the first time I went to apply that I got burned - there are some slight differences in how the filter method is handled in Django helper versus "native" Django. But for the most part, Django first is the way to go. In addition to the online documentation for these three tools/frameworks , I used Programming Google App Engine by Sanderson, which covers all of them - I highly recommend it.

Although my plan with Five Technologies is to use best practices, I am embarrassed to admit that I didn't practice much test driven development on this project despite having "home field advantage" with the technology - Django. Some of that is due to the exploratory nature of the programming - just figuring out which end is up, and some of it was the confusion over how to do tests - as noted above, the answer is "the Django way". Also, I found a bug in the app engine helper when loading a fixture, which hung me up - I'll be submitting a patch shortly.

One (non-technical) practice that I adopted was the Pomodoro Technique, and that worked pretty well. The 25 minute mini-sprints were really nice to contain technology-induced ADD. However, the watch I was using as my timer died during a pomodoro which lead to a very long and productive pomodoro.

In terms of the bigger goals of Five Technologies, I have another confession: I did not restrict myself to one week. I created version 0.1 and deployed it to the cloud within one week, but the week following this project I went to the Java Posse Roundup, and I kept working on the Django app, even though I should have been focusing on Java. It's just that after a week of working on GAE, I had built up some good momentum and didn't want to quit. I fear that this will be a real problem for the next projects because I won't have the luxury of continuing to work on whatever technology when I begin the next project.

In conclusion, the first technology experiment was a success, even if I wasn't dogmatic. What's next? Most likely Grails. Stay tuned.

Update:  I forgot to mention something cool I learned about - schema migration, or the lack thereof.  Before I began the project I was fretting about schema migration on GAE because I've been too lazy to learn something like South, and therefore I do schema migrations at a SQL command prompt.  Obviously, there is no SQL problem for a NoSQL database like Google's BigTable.  Then I forgot about the issue, but half way through I looked up and said, "hey, I haven't been doing any migrations, but this all works."  Duh! -  like many NoSQL databases, BigTable is schema-less, so there is no schema to migrate.  Problem solved.  OK, the application has to be prepared to do with an attribute on a record/row that isn't present, but that code is basically the same as dealing with a NULL value that you might specify as the default value when you issue an ALTER TABLE to add a column.  Also, you can still imagine scenarios where you might still have to do some sort of schema/data conversion, but without even consciously thinking about it, I managed to avoid those.  That was cool.

Monday, March 08, 2010

Five Technologies in Five Weeks

I am currently between consulting jobs, and during the down time, I have embarked on a project to learn five new (to me) technologies in five weeks. The reasons for doing this include:
  • Learn new things - this project is a variant on the "learn one new language a year" meme that's been going around. I'm just taking on five things (not necessarily languages) in much less than a year.
  • Bust some code - it's been a while since I've been able to do any hard-core coding. This will be a sprint which can blow out some cobwebs.
  • Improve my development practices by trying some new techniques and focusing on refining existing ones.
The technologies I plan on tackling are (subject to change):
I chose these technologies because I know of them but haven't actually built anything with them. Also, they are technologies that I'm interested in testing out to see how usable/effective they are and if I should pursue them further.

In a way, each week's technology is a bit like a "spike" in an agile project, only I'm not looking to evaluate/sketch out solutions to application problems but rather evaluate technologies in a more abstract sense. Although, in some cases (e.g., NetBeans Platform and Griffon) I have some application ideas I'd like to implement, and I really am spiking possible solutions.

For each technology, I've got a modest application in mind. I have (or will have) a series of story cards describing various aspects of the application I'd like to create. And then, I will sprint for a week to implement as many of the stories as possible. I'll also post at least one blog entry as a retrospective for each sprint.

I've been planning this for some time - ever since the end was in sight on my last contract. The most significant threat to this undertaking is not failing at one or more technologies, but rather if I find another contract before I've completed the technologies. (There are worse things than finding a paying job when you're currently between jobs.) Another known disruption to the "five weeks" is that I'm taking one week off to go to the Java Posse Roundup, which will technically make this five technologies in six weeks, but that isn't as snappy as five in five.

Thursday, March 04, 2010

Hiring is Only for Managers?

A friend of mine told me that he just met the new guy on their team. I thought it was odd that he was just meeting a new team member, so I asked if he wasn't around during the interviews, and he told me that the developers on the team never interview candidates - only managers do that. As near as I can tell, they do this to minimize the time required during interviewing. This is just wrong.

In the words of Manager Tools, "hiring is the most important job that a manager does" because, in part, failures pull down the whole team for months or years. And, team-fit is a crucial part of that interviewing process. All things being equal (which they never are), it's better to hire a technical 8 who has a 10 personality than it is to hire a technical 10 who was only an 8 (or less) personality. If nothing else, you can teach technical stuff, but you can't teach personality.

Dave Ramsey describes his lengthy, multi-round interview process that even includes dinner with spouses to ensure team (in the largest sense) fit. As lengthy as the process is, he points out that fixing a hiring mistake costs much more than the added time of proper interviews.

I'd even argue that this manager-only interviewing process produces shortcoming in the technical area, too, because the manager works off of a superficial checklist that s/he has to get through quickly in the interview. Thus, if a candidate is asked "do you know web framework X," and the answer is "no, but I know frameworks A through F, and I wrote a framework I call G," that candidate is treated the same as someone straight out of school who doesn't know any frameworks. This narrow-mindedness leads to hires that know framework X, but they store passwords as plain-text because no one told them not to, and none of them knows how to use MD5 to store a password (another story from my friend). These are what Erik Sink calls programmers not developers - you want (well-rounded) developers.

So, with apologies to Georges Clemenceau, interviewing is too important to be left exclusively to the managers.

Thursday, February 11, 2010

JIRA Workflow Transition Displaying Wrong Text

I've been doing some JIRA administration and customization for a client, and I ran into a problem that was driving me crazy. I created a new workflow to deal with an Issue Type called Change Request by copying an exsiting workflow. I renamed one of the transitions from "Close Issue" to "Decline CR," and I associated a screen with the transition (the reason for creating a new workflow).

The name of the transition ("Decline CR") displayed correctly in the administration page, but in the regular display page of a Change Request issue, the name of the workflow action was still the old value ("Close Issue") from the original workflow. I tried a few things to fix it, but nothing seemed to help. I even tried changing the workflow scheme and changing it back. Still nothing.

This morning while fishing around in the administration pages, I found the problem. The transition included an internationalization property, jira.i18n.title, that specifies what appears to be the title of the workflow action, closeissue.title. I picked up that property when I cloned the workflow - the ultimate origin of the workflow was the JIRA's default workflow, which was internationalized. Removing the property got the workflow action name ("Decline CR") to display correctly. (I'm not working in a multi-language environment, so I don't need any localization.)