Tuesday, May 07, 2013

Hadoop Beginner's Guide

Hadoop Beginner's Guide by Garry Turkington
ISBN: 1849517304

Hadoop Beginner's Guide is, as the title suggests, a new introductory book to the Hadoop ecosystem.  It provides an introduction to how to get up and running with the core components of Hadoop (Map-Reduce and HDFS),  some higher level tools like Hive, integration tools like Sqoop and Flume, and it also provides some good starting information relating to operational issues with Hadoop. This is not an exhaustive reference like Hadoop: The Definitive Guide, and for a beginner, that's probably a good thing.  (In my day, we only had The Definitive Guide, and we liked it!)

Most of the topics are covered in a "dive right in" format.  After some brief introduction to the topic the author provides a list of commands or a block of code and invites you to run it.  This is followed by "What just happened?" that explains the details of the operation or code.  Personally, I don't care for that too much because the explanation is sometimes separated from the code by multiple pages, which was a real hassle reading this as a PDF.  But, maybe that's just me.

As I mentioned, the book includes a couple of chapters on operations, which I found to be a nice addition to a beginner's book.  Some of these operational details were explained by hands-on experiments like shutting down processes or nodes, in which case "What just happened?" is more like "What just broke?"  The operational scenarios are by no means exhaustive (that's what you learn from production), but they provide the reader with some "real life" experience gained in a low-risk environment.  And, they introduce a powerful method to learn more operational details: set up an experiment and find out what happens.  Learning to learn is the most valuable thing you can gain from any book, class, or seminar.

Another nice feature of this book that I haven't seen in others is that the author includes examples of Amazon EC2 and Elastic Map Reduce (EMR).  There are examples of both Map Reduce and Hive jobs on EMR.  He doesn't do everything with "raw" Map Reduce and EMR because once you know the basics of EMR, the same principles apply to both raw Hadoop and EMR.

I do have some complaints about the book, but many of them are nit-picking or personal style.  That said, I think the biggest thing this book would benefit from would be some very detailed "technical editing."  By that I mean there are technical details that got corrupted during the book production process.  For example, the hadoop command is often rendered as Hadoop in examples.  There are plenty of similar formatting and typographic errors. Of course, an experienced Hadoop user wouldn't be tripped up by these, but this is a "beginner's guide," and such details can cause tremendous pain and suffering for newbies.

To wrap things up, Hadoop Beginner's Guide is a pretty good introduction to the Hadoop ecosystem.  I'd recommend it to anyone just starting out with Hadoop before moving on to something more reference-oriented like The Definitive Guide.

enjoy,
Charles.




FTC disclaimer: I received a free review copy of this book from DZone.  The links to Amazon above contain my Amazon Associates tag.

Friday, October 14, 2011

Why is my Rails app calling Solr so often?

I work on the back-end of a Rails app that uses Solr via Sunspot. Looking at the solr logs, I could see the same item being added/indexed repeatedly sometimes right before it was deleted from solr. I didn't write the code, but I was tasked with figuring it out.

Glancing at the main path of the code didn't show anything obvious. I figured the superfluous solr calls were  happening via callbacks somewhere in the graph of objects related to my object in solr, but which one(s).  Again, I didn't write the code, I just had to make it perform.

I hit on the idea of monkey-patching (for good, not evil) the Sunspot module.  Fortunately, most/all of the methods on the Sunspot module just forward the call onto the session object.  So, it's really easy to replace the original call with anything you want and still call the real Sunspot code, if that's what you want to do.

This is so easy to do that I even did it the first time in the rails console.  In that case, I was happy to abort the index operation when it first happened.  So, I whipped this up in a text file and pasted it into the console:

module Sunspot
  class <<self
    def index(*objects)
      raise "not gonna do it!"
    end
  end
end


Then, I invoked the destroy operation that was triggering the solr adds, got the stack trace, and could clearly see which dependent object was causing the index operation.

For another case, I needed to run a complex workflow in a script to trigger the offending solr operations. In that case, I wanted something automatically installed when the script started up, and I wanted something that didn't abort - all I wanted was a stack trace. So, I installed the monkey-patch in config/initializers/sunspot.rb and had a more involved index function:

    def index(*objects)
      puts "Indexing the following objects:"
      objects.each { |o| puts "#{o.class} - #{o.id}" }
      puts "From: =============="
      raise rescue puts $!.backtrace
      puts ===========\n"
      session.index(*objects)
    end


That last line is the body of the real version of the index method - like I said, trivial to re-implement; no alias chaining required.

Maybe there's some cooler way to figure this out, but this worked for me.

enjoy,
Charles.

Thursday, August 18, 2011

Rails/Rspec does not clean up model instances on MySQL

I recently solved a thorn in my side relating to some Rspec tests in our code base when running on my development machine using MySQL.  For some reason, some instances that were created using Factory Girl weren't getting cleaned up, which in turn would cause subsequent test runs to fail because of duplicate data.  So, I'd DELETE the whole tables from the MySQL prompt.  I looked in the test.log file, and I could see the save points being issued before the objects were created, but they weren't getting removed at the end of the test. 

I didn't have a lot of time to look into it, and I didn't know where to look - Rspec, Factory Girl, Rails?  So, in the short-term, I just added after_each calls to destroy the objects.  And, I moved on.

Then, I was dumping schemas in MySQL using SHOW CREATE TABLE in order to analyze some tables and indexes, and I noticed the storage ENGINE flag on the tables.  I went back and looked at the tables in my test database that were giving me trouble, and, of course(?), they were MyISAM rather than InnoDB.  So, transaction rollback (used to clean up after tests) didn't work.

I changed the storage engine on those tables (ALTER TABLE t1 ENGINE = InnoDB), commented out the manual clean-up code, and voila!  It works right now.  Pretty obvious in retrospect, but I didn't even know where to start looking in our stack.

I hope this helps some other poor souls, too.


Charles.

Tuesday, April 26, 2011

Freeing up phone space on Android

For the last couple of months my Motorola Droid running Android 2.2.2 has been complaining about being "low on space" for the phone, not the SD card.  I pruned some apps, but that didn't help much. Things really came to a head this morning when my phone was so low on memory that it was no longer downloading email.

I found this article to be quite helpful -
http://www.androidcentral.com/monthly-maintenance-keeping-things-speedy

For me, the two big ones were Messaging and the Browser cache.  I had a couple of threads in Messaging containing a number of pictures.  Once I saved the pictures off to the SD card, I purged the threads, that freed up ~20MB.  Clearing the browser cache freed another ~20MB, but that will probably evaporate again as the browser caches things.

Here's a minor whine about Android:  the SD card and phone storage settings page tells you how big your SD card is and how much space is remaining, but the phone storage just says how much is left.  Without knowing how much I had to start with, it's hard to know if, say, 20MB is a lot or not.  As near as I can tell, Android seems to complain when the space is less than 25MB.

Update: I ran out of space again, and clearing the browser cache didn't help.  After bumping around some more, first I discovered that in "Manage Applications" the one and only menu option is to sort by size.  Doing that revealed that the new pig was the (post pay-wall version) New York Times application.  It was using over 60MB of data space in the Phone Storage area.  The app doesn't have a "clear cache" function, so I used the "Clear Data" button from within Manage Applications, and I was back in action.

enjoy,
Charles.

Friday, November 05, 2010

A Fix for "Exceeded MAX_FAILED_UNIQUE_FETCHES" in Hadoop

In a project I'm currently working on, we're moving a bunch of our back-end processing to Hadoop.  We started a two-node cluster: one master, one slave.  That seemed to work fine.  Then, we went to four nodes, and about the same time I was testing out a new Hadoop job.  The (single) reducer was hanging with this somewhat cryptic message:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out

I went out to the slave node and looked through the job logs, and I could see that it was timing out trying to transfer data from one of the other slave nodes - an upstream mapper node. Upon closer scrutiny of the the log file I realized that Hadoop was trying to transfer from the other slave's public IP address, which is behind a firewall that blocks public access.

Key take-away number one: when you're just starting out with Hadoop, if you only have one slave, you've only demonstrated one real communication path: master-to-slave. Your cluster isn't doing any slave-to-slave transfers because everything was on the one slave. Also, our initial job had no reducer, so it ran fine on the new, 4-node cluster because it was still only master-slave communication.

For some reason, the mapper slave was advertising the location of the map output data via its public IP address. My first attempt at fixing this problem involved the dfs.datanode.dns.interface configuration parameter (and it's mapred equivalent). This tells Hadoop that when a process (mapred or dfs) wants to figure out it's host name, use the IP address associated with the given interface. (You could even have dfs and mapred using separate interfaces for additional throughput.)

This failed for me because I had one interface with two addresses, not two interfaces. I dug through the Hadoop DNS code (org.apache.hadoop.net.DNS - God, I love open-source: you can just look for yourself) and saw that if there is one interface, the code loops through the IP addresses and performs reverse DNS lookups and takes the first successful result. I was fortunate in that the private IP address was coming up first in that enumeration of the IPs on the interface, but it still wasn't working. I talked to our system admin/configuration guru. It turns out that our hosting provider doesn't provide reverse DNS for those private IP addresses. We could have set up our own DNS server for just these reverse lookups, but there was a brute-force option available to us.

You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide. Now, in theory, this might be onerous - it means you have a different configuration file per-slave. However, our cluster is configured and maintained via Puppet. So, our puppet master just tweaked his Puppet script, and we never looked back.

Take-away number two: Exceeded MAX_FAILED_UNIQUE_FETCHES could mean a simple connectivity problem. I'm sure there are other possible causes, but an inability to connect between slaves is comparatively simple to troubleshoot.

enjoy,
Charles.

Tuesday, April 06, 2010

Django vs. Grails

When I came up with the Five Technologies in Five Weeks project, I hadn't intended to compare any of the technologies to each other directly.  My original ordering for the technologies didn't have any head-to-head match-ups, but I just finished my Grails project, which was "right after" (modulo interruptions) my Django/GAE project.  And, while I was working on Grails, I kept comparing it to Django, and I felt compelled to write up my observations.

"All things being equal, which they never are" (Manager Tools), I'd choose Django over Grails for a new, green-field project.  However, given the constraint to run in a Java environment, I'd gladly choose Grails over the other Java/J2EE frameworks I'm familiar with.  (Django on Jython would be a contender as would Wicket, but I don't have hands-on experience with either.)  This is not a slam on Grails.

Here are some of the reasons I prefer Django over Grails:
  • Development Speed (not runtime performance): running (a trivial number of) unit tests for Grails on my machine took about 15 seconds (with Folding@Home in the background).  Django unit tests typically only take me a couple of seconds (for a trivial number of tests).  Integration tests in Grails were 30-40 seconds; Django might be 5 seconds.  Although Grails will reload the running webapp when you make a change, sometimes things are really buggered, which requires a full restart.  That seemed to happen more often in Grails than Django, and restarts take ~20 seconds vs. 2 seconds.
    Compared to the 5-7 minutes I've seen WebSphere to restart an application, Grails is blazingly fast.  But the 15-30 seconds that many operations take in Grails is long enough for the mind to wander, which is not good.
    Update: a commenter pointed out the (under-documented) interactive mode of Grails.  That speeds the edit-test-edit loop considerably.
  • Voodoo: both Grails and Django do some voodoo behind the scenes to reduce the amount of chimp programming (e.g., non-DRY boilerplate) the programmer has to do.  This voodoo involves topics of the high priests like meta object programming that mere mortals typically don't have to worry about.  Maybe it's just that I don't have as much "time behind the wheel" with Grails, but I feel like its voodoo leaks out a bit and is too voodoo-rific.  A Grails programmer has to be aware of which methods are dynamically added to a class in order to separate unit tests (no voodoo) from integration tests (full voodoo). And, although I like the power of dynamic finder methods like User.findByName(), it kinda bugs me that the code for that doesn't exist somewhere that I can see (you know, on punch cards!)  You don't see the dynamic methods on a Django class, but there are a lot fewer of them, so it seems less voodoo-rific.  Again, maybe more time behind the wheel of Grails would make me feel more comfortable.
    (As an aside, when I taught Python and Django in Spring 2009, the students didn't even notice the voodoo of fields on models in Django until I pointed it out to them.  Then, I exploded their heads with MOP.)
  • View Technology: I have a long-standing personal preference for templates instead of ASP/JSP-like mark-up languages, and I'd lump Grails GSP pages in the latter category.  Back in the day, Jason Hunter wrote an essay called The Problems with JSP that really stuck with me.  GSPs are much, much better than the Model 1 JSP pages that Hunter talks about, but they still feel similar enough to make me think about using something like Velocity with Grails.  Django kicks templates up a notch by having inheritance with templates, which I really love. (Even if Django didn't invent template inheritance.)
  • Dynamic Language Issues: This is a real unfair comparison because I've been using Python for 15 years, but I felt that errors in Grails/Groovy were more cryptic and hard to find compared to Python.  If I typo the named parameter to a method in Python, it lets me know immediately.  More than once in Grails I misspelled a named parameter to a constructor, which failed silently and then lead to a validation failure later when I went to save the object.  Some of that is from lack of experience with Groovy/Grails - i.e., learning what error messages really mean.  But still, Python seems to fail in a more helpful way.  (I'd say Python fails gracefully, but if anything it's the opposite: very loud - "you dumb-ass, there is no parameter called recipent" when I misspelled recipient.)
I don't mean for this to be a hit-piece against Grails; I really like Grails.  I look forward to using it some more.  It's just that, for what it's worth, I like Django more.

5Tech - Week 2: Grails

For the second project and technology in five weeks I chose Grails.  I was very curious to try Grails on on a "real"/non-tutorial project to confirm its usability and productivity.  On my last contract gig, I was in a non-development role (configuration management) on a project that was using JSF 1.1/Spring/Hibernate and WebSphere, and their development looked really painful by my standards.  For example, they'd spend 5-7 minutes to redeploy the app just to inspect some HTML change in the JSF page.  I was looking to Grails to provide a much more productive environment.

The project involved creating a system to process events and route them to users.  It was inspired by a code base that I worked on for a previous client.  (This was done with their permission.)  They have a large code base that includes "application functionality" like this event routing, as well as a lot of "technical functionality" that they rolled themselves years ago before modern tools like Spring and Hibernate were created and became mainstream.  In a way, this was a small prototype to research the feasibility of reimplementing the application functionality on a more modern platform - Grails.

Analysis
All in all the project went very smoothly. Because Grails is comparatively mature, there exists a fair amount of documentation, including numerous books.  I leaned heavily on Grails in Action by Smith and Ledbrook.  I was able to follow their examples and adapt them to my application easily.  There just weren't any serious gaps or surprises.

Using Grails to create domain classes (database entities) was a breeze.  It's so nice to just declare the fields and their constraints and have Grails "do the right thing."  You don't have to bother with annotations, let alone XML configuration files.  Creating controllers to process web requests is trivial, and being able to scaffold them to get the basic CRUD functionality in place immediately was very conducive to high productivity.

I also liked the idea that services are a first-class concept in Grails (along with controllers and domain classes), which makes it very easy to program with them.   Grails wires them into the classes that use the services via Spring, but again, you don't have to monkey with Spring's applicationContext.xml file.  Finally, adding the REST interface for incoming events was almost trivial, especially since I already had the service in place.

Grails (like Rails) treats testing as a core concept, not something you wave your hands at after the fact.  It has the concept of unit and integration tests, and there are a number of functional test plug-ins, too.  I did a fair amount of unit and integration testing, which was a real live-saver.  Due to the very dynamic nature of Groovy, many of my typos were not caught in the compile phase, but exercising the code in tests did catch those.  I had very few issues when I ran the actual web application.

A lot of developers are unclear on the difference between unit testing and integration testing.  For better or worse, Gails makes some clear distinctions.  A lot of the Grails voodoo (e.g., dependency injection and adding dynamic methods to classes) is not available in unit tests, or you have to add it yourself via mocking or manual injection.  Thus, there is a simple distinction: if it runs under "test-app -unit", it's a unit test, otherwise it's an integration test.  This bit me once early on when I wrote a test that needed that higher-order functionality, but I put it in a unit test.  First I got a null pointer exception because the service hadn't been injected, then I got a missing method exception because the save method on the domain object hadn't been added dynamically. However, the fix was simple enough - move the test to the integration folder, and it ran fine with -integration.  Then, I copied it back, added a few mocks (very easy in Grails), and I had a unit test, too.

In terms of "project management," this step of the Five Technologies suffered from some "life happens" distractions.  Rather than running Monday through Friday, it ended up being Tuesday through Monday, and some of those days were a bit short-changed.  When I was on task, the Pomodoro Technique continued to be effective, and I replaced my dead watch/timer with a dedicated Android application called Pomodoro Tasks - it's even open-source.

In conclusion, Grails was another success for the Five Technologies in Five Weeks project.  I got the application functionality done that I expected.  Grails was very usable and productive - no nasty surprises.   What's next?  Probably, NetBeans Platform because I have documentation for it already whereas I'm waiting on some Android books.

Wednesday, March 24, 2010

5Tech - Week 1: Google App Engine

For the first of the five technologies in five weeks, I picked something easy - Google App Engine using Python and Django. As someone who's been using Python for 15 years, there was no language learning curve. And using the Django helper for app engine package allowed me to leverage my Django experience. So with a minimal learning curve, the results were basically all good.

The project involved creating a simple application to monitor web sites to check if they are up or down and notify the user about status changes. As such, the core of the application isn't even a web app. In fact, I've implemented the same thing a couple of different times as a standalone program in Python or Java. However, to run a standalone application like that requires a server where it can run, and that's not something I've always had access to. GAE's cron service provides the ability to run checks periodically - just like the main loop in my standalone applications, and GAE provides a large number of notification options, although I only used email initially. The application does have a web interface for configuring checks and viewing the status

Project Analysis

My biggest problem developing the application was the fact that I was working with three separate but overlapping tools/frameworks: GAE, Django, and the Django helper. These all worked fine, but when I wanted to do some task, I didn't know where to look for the "right" way to do it. After wasting a bunch of time searching for things only to find that the Django way was the right way, I just quit asking and adopted the "try Django first" strategy. And then, the first time I went to apply that I got burned - there are some slight differences in how the filter method is handled in Django helper versus "native" Django. But for the most part, Django first is the way to go. In addition to the online documentation for these three tools/frameworks , I used Programming Google App Engine by Sanderson, which covers all of them - I highly recommend it.

Although my plan with Five Technologies is to use best practices, I am embarrassed to admit that I didn't practice much test driven development on this project despite having "home field advantage" with the technology - Django. Some of that is due to the exploratory nature of the programming - just figuring out which end is up, and some of it was the confusion over how to do tests - as noted above, the answer is "the Django way". Also, I found a bug in the app engine helper when loading a fixture, which hung me up - I'll be submitting a patch shortly.

One (non-technical) practice that I adopted was the Pomodoro Technique, and that worked pretty well. The 25 minute mini-sprints were really nice to contain technology-induced ADD. However, the watch I was using as my timer died during a pomodoro which lead to a very long and productive pomodoro.

In terms of the bigger goals of Five Technologies, I have another confession: I did not restrict myself to one week. I created version 0.1 and deployed it to the cloud within one week, but the week following this project I went to the Java Posse Roundup, and I kept working on the Django app, even though I should have been focusing on Java. It's just that after a week of working on GAE, I had built up some good momentum and didn't want to quit. I fear that this will be a real problem for the next projects because I won't have the luxury of continuing to work on whatever technology when I begin the next project.

In conclusion, the first technology experiment was a success, even if I wasn't dogmatic. What's next? Most likely Grails. Stay tuned.

Update:  I forgot to mention something cool I learned about - schema migration, or the lack thereof.  Before I began the project I was fretting about schema migration on GAE because I've been too lazy to learn something like South, and therefore I do schema migrations at a SQL command prompt.  Obviously, there is no SQL problem for a NoSQL database like Google's BigTable.  Then I forgot about the issue, but half way through I looked up and said, "hey, I haven't been doing any migrations, but this all works."  Duh! -  like many NoSQL databases, BigTable is schema-less, so there is no schema to migrate.  Problem solved.  OK, the application has to be prepared to do with an attribute on a record/row that isn't present, but that code is basically the same as dealing with a NULL value that you might specify as the default value when you issue an ALTER TABLE to add a column.  Also, you can still imagine scenarios where you might still have to do some sort of schema/data conversion, but without even consciously thinking about it, I managed to avoid those.  That was cool.

Monday, March 08, 2010

Five Technologies in Five Weeks

I am currently between consulting jobs, and during the down time, I have embarked on a project to learn five new (to me) technologies in five weeks. The reasons for doing this include:
  • Learn new things - this project is a variant on the "learn one new language a year" meme that's been going around. I'm just taking on five things (not necessarily languages) in much less than a year.
  • Bust some code - it's been a while since I've been able to do any hard-core coding. This will be a sprint which can blow out some cobwebs.
  • Improve my development practices by trying some new techniques and focusing on refining existing ones.
The technologies I plan on tackling are (subject to change):
I chose these technologies because I know of them but haven't actually built anything with them. Also, they are technologies that I'm interested in testing out to see how usable/effective they are and if I should pursue them further.

In a way, each week's technology is a bit like a "spike" in an agile project, only I'm not looking to evaluate/sketch out solutions to application problems but rather evaluate technologies in a more abstract sense. Although, in some cases (e.g., NetBeans Platform and Griffon) I have some application ideas I'd like to implement, and I really am spiking possible solutions.

For each technology, I've got a modest application in mind. I have (or will have) a series of story cards describing various aspects of the application I'd like to create. And then, I will sprint for a week to implement as many of the stories as possible. I'll also post at least one blog entry as a retrospective for each sprint.

I've been planning this for some time - ever since the end was in sight on my last contract. The most significant threat to this undertaking is not failing at one or more technologies, but rather if I find another contract before I've completed the technologies. (There are worse things than finding a paying job when you're currently between jobs.) Another known disruption to the "five weeks" is that I'm taking one week off to go to the Java Posse Roundup, which will technically make this five technologies in six weeks, but that isn't as snappy as five in five.

Thursday, March 04, 2010

Hiring is Only for Managers?

A friend of mine told me that he just met the new guy on their team. I thought it was odd that he was just meeting a new team member, so I asked if he wasn't around during the interviews, and he told me that the developers on the team never interview candidates - only managers do that. As near as I can tell, they do this to minimize the time required during interviewing. This is just wrong.

In the words of Manager Tools, "hiring is the most important job that a manager does" because, in part, failures pull down the whole team for months or years. And, team-fit is a crucial part of that interviewing process. All things being equal (which they never are), it's better to hire a technical 8 who has a 10 personality than it is to hire a technical 10 who was only an 8 (or less) personality. If nothing else, you can teach technical stuff, but you can't teach personality.

Dave Ramsey describes his lengthy, multi-round interview process that even includes dinner with spouses to ensure team (in the largest sense) fit. As lengthy as the process is, he points out that fixing a hiring mistake costs much more than the added time of proper interviews.

I'd even argue that this manager-only interviewing process produces shortcoming in the technical area, too, because the manager works off of a superficial checklist that s/he has to get through quickly in the interview. Thus, if a candidate is asked "do you know web framework X," and the answer is "no, but I know frameworks A through F, and I wrote a framework I call G," that candidate is treated the same as someone straight out of school who doesn't know any frameworks. This narrow-mindedness leads to hires that know framework X, but they store passwords as plain-text because no one told them not to, and none of them knows how to use MD5 to store a password (another story from my friend). These are what Erik Sink calls programmers not developers - you want (well-rounded) developers.

So, with apologies to Georges Clemenceau, interviewing is too important to be left exclusively to the managers.