I interviewed James Turnbull on the Software Engineering Radio podcast to discuss Docker.
Charles.
Wednesday, January 07, 2015
Thursday, August 07, 2014
Mitchell Hashimoto on the Vagrant Project
I almost forgot to mention it here, but my first podcast episode with SE Radio went live the last week of July. In it, I interviewed Mitchell Hashimoto about the Vagrant project. Stay tuned for more...
Charles.
Charles.
Tuesday, April 15, 2014
Living Dangerously with MySQL Fatal Error 1236
This past weekend, the data center where our MySQL master resides suffered some issues. At first I thought it was just some connectivity issues, but it was a power outage and our nodes were all rebooted. While cleaning up various messes from that, I discovered that our D/R slave in another data center was stuck with the error message:
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.014593' at 52888137, the last event read from '/var/log/mysql/mysql-bin.014593' at 4, the last byte read from '/var/log/mysql/mysql-bin.014593' at 4.'
Googling around, I found that this error means that the slave got more data from the master than what the master wrote to its log before the crash. The result was that the slave was asking for data beyond the end of the log file - the master started a new log when it restarted.
I wondered if it would be possible to just move on to the next log, and looking at more postings, I found that it is possible.
NOTE: this procedure is dangerous and may lead to data loss or corruption. I would never do it on a truly critical system like a financial system.
I figured it was worth a try. In the worst case, I would hose the slave and have to rebuild from a fresh dump, which was the only other alternative. I also realized that when the slave restarted, there might be some replication issues around that area in the "log transition."
As the blonde said, "do you want to live forever?"
So, I stopped the slave, moved it to the beginning of the next log file and started it again.
STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS = 4;
CHANGE MASTER TO MASTER_LOG_FILE = 'mysql-bin. 014593';
START SLAVE;
As anticipated, there were issues. There were a number of UPDATE statements that couldn't be applied because of a missing row. I steered around them, one at a time with:
STOP SLAVE ;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER =1;
START SLAVE ;
It was a hassle, and it took interventions that I expected, but it was quicker than shutting down my production applications to take a consistent dump, transferring, and restoring it. And, while I was babysitting it, I could write a blog post.
Your milage may vary,
Charles.
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.014593' at 52888137, the last event read from '/var/log/mysql/mysql-bin.014593' at 4, the last byte read from '/var/log/mysql/mysql-bin.014593' at 4.'
Googling around, I found that this error means that the slave got more data from the master than what the master wrote to its log before the crash. The result was that the slave was asking for data beyond the end of the log file - the master started a new log when it restarted.
I wondered if it would be possible to just move on to the next log, and looking at more postings, I found that it is possible.
NOTE: this procedure is dangerous and may lead to data loss or corruption. I would never do it on a truly critical system like a financial system.
I figured it was worth a try. In the worst case, I would hose the slave and have to rebuild from a fresh dump, which was the only other alternative. I also realized that when the slave restarted, there might be some replication issues around that area in the "log transition."
As the blonde said, "do you want to live forever?"
So, I stopped the slave, moved it to the beginning of the next log file and started it again.
STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS = 4;
CHANGE MASTER TO MASTER_LOG_FILE = 'mysql-bin. 014593';
START SLAVE;
As anticipated, there were issues. There were a number of UPDATE statements that couldn't be applied because of a missing row. I steered around them, one at a time with:
STOP SLAVE ;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER =1;
START SLAVE ;
It was a hassle, and it took interventions that I expected, but it was quicker than shutting down my production applications to take a consistent dump, transferring, and restoring it. And, while I was babysitting it, I could write a blog post.
Your milage may vary,
Charles.
Monday, November 25, 2013
Active Record - joins + include Methods Causing an Unintended Join
I recently fixed a problem in my Rails 3.2 app where I was using both the joins and include methods in an Active Record query, and it was triggering a join that I didn't want. WTF? Why are you using include and joins, and you don't want a join?
I needed to run a query on table A and I needed to apply criteria against another table B. Thus, I needed to (inner) join those two with the joins method. For the rows of A that met the search criteria, I wanted to eagerly the corresponding rows from tables X, Y, and Z. Of course, I wanted to avoid a 3N+1 query situation. So, I also used the includes method.
Typically, the includes method generates a query by IDs for the related objects. In my case, I was getting four INNER JOINs - one each for B, X, Y, and Z. Under "normal" circumstance, maybe this would have been OK, but my problem is table Y is in a separate database, and you can't join across databases. (You can't really do transactions across databases, either.)
My original code used an array of named associations in the joins method - joins(:bs). On a lark, I decided to recode it to use a string - joins('INNER JOIN bs ON bs.a_id = as.id'), and it worked: I got the inner join for B and three individual queries for X, Y, and Z. Because Y is queried as a simple query with an array of IDs, the fact that Y is in another database isn't a problem - it just works.
Anyway, if you've stumbled across this post while trying to solve the same problem, I hope this helps.
Charles.
I needed to run a query on table A and I needed to apply criteria against another table B. Thus, I needed to (inner) join those two with the joins method. For the rows of A that met the search criteria, I wanted to eagerly the corresponding rows from tables X, Y, and Z. Of course, I wanted to avoid a 3N+1 query situation. So, I also used the includes method.
Typically, the includes method generates a query by IDs for the related objects. In my case, I was getting four INNER JOINs - one each for B, X, Y, and Z. Under "normal" circumstance, maybe this would have been OK, but my problem is table Y is in a separate database, and you can't join across databases. (You can't really do transactions across databases, either.)
My original code used an array of named associations in the joins method - joins(:bs). On a lark, I decided to recode it to use a string - joins('INNER JOIN bs ON bs.a_id = as.id'), and it worked: I got the inner join for B and three individual queries for X, Y, and Z. Because Y is queried as a simple query with an array of IDs, the fact that Y is in another database isn't a problem - it just works.
Anyway, if you've stumbled across this post while trying to solve the same problem, I hope this helps.
Charles.
Thursday, May 23, 2013
Ctags for Puppet - Three (previously missing) Pieces
Back in the day, when I was coding in C on the Unix kernel (before Linux even existed), I used vi's tags functionality extensively. We had a patched version of vi (before vim existed) that supported tag stacks and a hacked version of ctags that picked up all kinds of things like #defines, and it used the -D flags you used when compiling to get you to the right definition of something that was defined many times for various architectures, etc. But, when I moved to C++ with function overloading, ctags broke down for me, and I quit using it.
Recently, I inherited pretty big Puppet code base. For a long time, I was just navigating it by hand using lots of find and grep commands. Finally, I broke down and figured out how to get ctags working for my Puppet code on OS X. Actually, other people figured it out, but here were the three pieces I had to string together.
A modern version of ctags - aka exuberant ctags. This is pretty easy to install with homebrew, but there is a rub: OS X already has a version of it installed, and depending on how your PATH is configured, the stock version might trump homebrew's version. Matt Pollito has a nice, concise blog post explaining how to cope with that.
Tell ctags about Puppet's syntax: Paul Nasrat has a little post describing the definitions needed in the ~/.ctags file and the invocation of ctags.
Tell vim about Puppet's syntax: Netdata's virmrc file has the last piece:
set iskeyword=-,:,@,48-57,_,192-255
The colon is the key there (no pun intended) - without that, vim wasn't dealing with scoped identifiers and was just hitting the top-level modules.
The last bit is for me to re-learn the muscle memory for navigating with tags that has atrophied after 20 years give or take. BTW, if you don't have tags, a cool approximation within a single file is '*' in command mode - it searches for the word under the cursor.
enjoy,
Charles.
Tuesday, May 07, 2013
Hadoop Beginner's Guide
Hadoop Beginner's Guide by Garry Turkington
ISBN: 1849517304
Hadoop Beginner's Guide is, as the title suggests, a new introductory book to the Hadoop ecosystem. It provides an introduction to how to get up and running with the core components of Hadoop (Map-Reduce and HDFS), some higher level tools like Hive, integration tools like Sqoop and Flume, and it also provides some good starting information relating to operational issues with Hadoop. This is not an exhaustive reference like Hadoop: The Definitive Guide, and for a beginner, that's probably a good thing. (In my day, we only had The Definitive Guide, and we liked it!)
Most of the topics are covered in a "dive right in" format. After some brief introduction to the topic the author provides a list of commands or a block of code and invites you to run it. This is followed by "What just happened?" that explains the details of the operation or code. Personally, I don't care for that too much because the explanation is sometimes separated from the code by multiple pages, which was a real hassle reading this as a PDF. But, maybe that's just me.
As I mentioned, the book includes a couple of chapters on operations, which I found to be a nice addition to a beginner's book. Some of these operational details were explained by hands-on experiments like shutting down processes or nodes, in which case "What just happened?" is more like "What just broke?" The operational scenarios are by no means exhaustive (that's what you learn from production), but they provide the reader with some "real life" experience gained in a low-risk environment. And, they introduce a powerful method to learn more operational details: set up an experiment and find out what happens. Learning to learn is the most valuable thing you can gain from any book, class, or seminar.
Another nice feature of this book that I haven't seen in others is that the author includes examples of Amazon EC2 and Elastic Map Reduce (EMR). There are examples of both Map Reduce and Hive jobs on EMR. He doesn't do everything with "raw" Map Reduce and EMR because once you know the basics of EMR, the same principles apply to both raw Hadoop and EMR.
I do have some complaints about the book, but many of them are nit-picking or personal style. That said, I think the biggest thing this book would benefit from would be some very detailed "technical editing." By that I mean there are technical details that got corrupted during the book production process. For example, the hadoop command is often rendered as Hadoop in examples. There are plenty of similar formatting and typographic errors. Of course, an experienced Hadoop user wouldn't be tripped up by these, but this is a "beginner's guide," and such details can cause tremendous pain and suffering for newbies.
To wrap things up, Hadoop Beginner's Guide is a pretty good introduction to the Hadoop ecosystem. I'd recommend it to anyone just starting out with Hadoop before moving on to something more reference-oriented like The Definitive Guide.
enjoy,
Charles.
FTC disclaimer: I received a free review copy of this book from DZone. The links to Amazon above contain my Amazon Associates tag.
ISBN: 1849517304
Hadoop Beginner's Guide is, as the title suggests, a new introductory book to the Hadoop ecosystem. It provides an introduction to how to get up and running with the core components of Hadoop (Map-Reduce and HDFS), some higher level tools like Hive, integration tools like Sqoop and Flume, and it also provides some good starting information relating to operational issues with Hadoop. This is not an exhaustive reference like Hadoop: The Definitive Guide, and for a beginner, that's probably a good thing. (In my day, we only had The Definitive Guide, and we liked it!)
Most of the topics are covered in a "dive right in" format. After some brief introduction to the topic the author provides a list of commands or a block of code and invites you to run it. This is followed by "What just happened?" that explains the details of the operation or code. Personally, I don't care for that too much because the explanation is sometimes separated from the code by multiple pages, which was a real hassle reading this as a PDF. But, maybe that's just me.
As I mentioned, the book includes a couple of chapters on operations, which I found to be a nice addition to a beginner's book. Some of these operational details were explained by hands-on experiments like shutting down processes or nodes, in which case "What just happened?" is more like "What just broke?" The operational scenarios are by no means exhaustive (that's what you learn from production), but they provide the reader with some "real life" experience gained in a low-risk environment. And, they introduce a powerful method to learn more operational details: set up an experiment and find out what happens. Learning to learn is the most valuable thing you can gain from any book, class, or seminar.
Another nice feature of this book that I haven't seen in others is that the author includes examples of Amazon EC2 and Elastic Map Reduce (EMR). There are examples of both Map Reduce and Hive jobs on EMR. He doesn't do everything with "raw" Map Reduce and EMR because once you know the basics of EMR, the same principles apply to both raw Hadoop and EMR.
I do have some complaints about the book, but many of them are nit-picking or personal style. That said, I think the biggest thing this book would benefit from would be some very detailed "technical editing." By that I mean there are technical details that got corrupted during the book production process. For example, the hadoop command is often rendered as Hadoop in examples. There are plenty of similar formatting and typographic errors. Of course, an experienced Hadoop user wouldn't be tripped up by these, but this is a "beginner's guide," and such details can cause tremendous pain and suffering for newbies.
To wrap things up, Hadoop Beginner's Guide is a pretty good introduction to the Hadoop ecosystem. I'd recommend it to anyone just starting out with Hadoop before moving on to something more reference-oriented like The Definitive Guide.
enjoy,
Charles.
FTC disclaimer: I received a free review copy of this book from DZone. The links to Amazon above contain my Amazon Associates tag.
Friday, October 14, 2011
Why is my Rails app calling Solr so often?
I work on the back-end of a Rails app that uses Solr via Sunspot. Looking at the solr logs, I could see the same item being added/indexed repeatedly sometimes right before it was deleted from solr. I didn't write the code, but I was tasked with figuring it out.
Glancing at the main path of the code didn't show anything obvious. I figured the superfluous solr calls were happening via callbacks somewhere in the graph of objects related to my object in solr, but which one(s). Again, I didn't write the code, I just had to make it perform.
I hit on the idea of monkey-patching (for good, not evil) the Sunspot module. Fortunately, most/all of the methods on the Sunspot module just forward the call onto the session object. So, it's really easy to replace the original call with anything you want and still call the real Sunspot code, if that's what you want to do.
This is so easy to do that I even did it the first time in the rails console. In that case, I was happy to abort the index operation when it first happened. So, I whipped this up in a text file and pasted it into the console:
module Sunspot
class <<self
def index(*objects)
raise "not gonna do it!"
end
end
end
Then, I invoked the destroy operation that was triggering the solr adds, got the stack trace, and could clearly see which dependent object was causing the index operation.
For another case, I needed to run a complex workflow in a script to trigger the offending solr operations. In that case, I wanted something automatically installed when the script started up, and I wanted something that didn't abort - all I wanted was a stack trace. So, I installed the monkey-patch in config/initializers/sunspot.rb and had a more involved index function:
def index(*objects)
puts "Indexing the following objects:"
objects.each { |o| puts "#{o.class} - #{o.id}" }
puts "From: =============="
raise rescue puts $!.backtrace
puts ===========\n"
session.index(*objects)
end
That last line is the body of the real version of the index method - like I said, trivial to re-implement; no alias chaining required.
Maybe there's some cooler way to figure this out, but this worked for me.
enjoy,
Charles.
Glancing at the main path of the code didn't show anything obvious. I figured the superfluous solr calls were happening via callbacks somewhere in the graph of objects related to my object in solr, but which one(s). Again, I didn't write the code, I just had to make it perform.
I hit on the idea of monkey-patching (for good, not evil) the Sunspot module. Fortunately, most/all of the methods on the Sunspot module just forward the call onto the session object. So, it's really easy to replace the original call with anything you want and still call the real Sunspot code, if that's what you want to do.
This is so easy to do that I even did it the first time in the rails console. In that case, I was happy to abort the index operation when it first happened. So, I whipped this up in a text file and pasted it into the console:
module Sunspot
class <<self
def index(*objects)
raise "not gonna do it!"
end
end
end
Then, I invoked the destroy operation that was triggering the solr adds, got the stack trace, and could clearly see which dependent object was causing the index operation.
For another case, I needed to run a complex workflow in a script to trigger the offending solr operations. In that case, I wanted something automatically installed when the script started up, and I wanted something that didn't abort - all I wanted was a stack trace. So, I installed the monkey-patch in config/initializers/sunspot.rb and had a more involved index function:
def index(*objects)
puts "Indexing the following objects:"
objects.each { |o| puts "#{o.class} - #{o.id}" }
puts "From: =============="
raise rescue puts $!.backtrace
puts ===========\n"
session.index(*objects)
end
That last line is the body of the real version of the index method - like I said, trivial to re-implement; no alias chaining required.
Maybe there's some cooler way to figure this out, but this worked for me.
enjoy,
Charles.
Thursday, August 18, 2011
Rails/Rspec does not clean up model instances on MySQL
I recently solved a thorn in my side relating to some Rspec tests in our code base when running on my development machine using MySQL. For some reason, some instances that were created using Factory Girl weren't getting cleaned up, which in turn would cause subsequent test runs to fail because of duplicate data. So, I'd DELETE the whole tables from the MySQL prompt. I looked in the test.log file, and I could see the save points being issued before the objects were created, but they weren't getting removed at the end of the test.
I didn't have a lot of time to look into it, and I didn't know where to look - Rspec, Factory Girl, Rails? So, in the short-term, I just added after_each calls to destroy the objects. And, I moved on.
Then, I was dumping schemas in MySQL using SHOW CREATE TABLE in order to analyze some tables and indexes, and I noticed the storage ENGINE flag on the tables. I went back and looked at the tables in my test database that were giving me trouble, and, of course(?), they were MyISAM rather than InnoDB. So, transaction rollback (used to clean up after tests) didn't work.
I changed the storage engine on those tables (ALTER TABLE t1 ENGINE = InnoDB), commented out the manual clean-up code, and voila! It works right now. Pretty obvious in retrospect, but I didn't even know where to start looking in our stack.
I hope this helps some other poor souls, too.
Charles.
I didn't have a lot of time to look into it, and I didn't know where to look - Rspec, Factory Girl, Rails? So, in the short-term, I just added after_each calls to destroy the objects. And, I moved on.
Then, I was dumping schemas in MySQL using SHOW CREATE TABLE in order to analyze some tables and indexes, and I noticed the storage ENGINE flag on the tables. I went back and looked at the tables in my test database that were giving me trouble, and, of course(?), they were MyISAM rather than InnoDB. So, transaction rollback (used to clean up after tests) didn't work.
I changed the storage engine on those tables (ALTER TABLE t1 ENGINE = InnoDB), commented out the manual clean-up code, and voila! It works right now. Pretty obvious in retrospect, but I didn't even know where to start looking in our stack.
I hope this helps some other poor souls, too.
Charles.
Tuesday, April 26, 2011
Freeing up phone space on Android
For the last couple of months my Motorola Droid running Android 2.2.2 has been complaining about being "low on space" for the phone, not the SD card. I pruned some apps, but that didn't help much. Things really came to a head this morning when my phone was so low on memory that it was no longer downloading email.
I found this article to be quite helpful -
http://www.androidcentral.com/monthly-maintenance-keeping-things-speedy
For me, the two big ones were Messaging and the Browser cache. I had a couple of threads in Messaging containing a number of pictures. Once I saved the pictures off to the SD card, I purged the threads, that freed up ~20MB. Clearing the browser cache freed another ~20MB, but that will probably evaporate again as the browser caches things.
Here's a minor whine about Android: the SD card and phone storage settings page tells you how big your SD card is and how much space is remaining, but the phone storage just says how much is left. Without knowing how much I had to start with, it's hard to know if, say, 20MB is a lot or not. As near as I can tell, Android seems to complain when the space is less than 25MB.
Update: I ran out of space again, and clearing the browser cache didn't help. After bumping around some more, first I discovered that in "Manage Applications" the one and only menu option is to sort by size. Doing that revealed that the new pig was the (post pay-wall version) New York Times application. It was using over 60MB of data space in the Phone Storage area. The app doesn't have a "clear cache" function, so I used the "Clear Data" button from within Manage Applications, and I was back in action.
enjoy,
Charles.
I found this article to be quite helpful -
http://www.androidcentral.com/monthly-maintenance-keeping-things-speedy
For me, the two big ones were Messaging and the Browser cache. I had a couple of threads in Messaging containing a number of pictures. Once I saved the pictures off to the SD card, I purged the threads, that freed up ~20MB. Clearing the browser cache freed another ~20MB, but that will probably evaporate again as the browser caches things.
Here's a minor whine about Android: the SD card and phone storage settings page tells you how big your SD card is and how much space is remaining, but the phone storage just says how much is left. Without knowing how much I had to start with, it's hard to know if, say, 20MB is a lot or not. As near as I can tell, Android seems to complain when the space is less than 25MB.
Update: I ran out of space again, and clearing the browser cache didn't help. After bumping around some more, first I discovered that in "Manage Applications" the one and only menu option is to sort by size. Doing that revealed that the new pig was the (post pay-wall version) New York Times application. It was using over 60MB of data space in the Phone Storage area. The app doesn't have a "clear cache" function, so I used the "Clear Data" button from within Manage Applications, and I was back in action.
enjoy,
Charles.
Friday, November 05, 2010
A Fix for "Exceeded MAX_FAILED_UNIQUE_FETCHES" in Hadoop
In a project I'm currently working on, we're moving a bunch of our back-end processing to Hadoop. We started a two-node cluster: one master, one slave. That seemed to work fine. Then, we went to four nodes, and about the same time I was testing out a new Hadoop job. The (single) reducer was hanging with this somewhat cryptic message:
I went out to the slave node and looked through the job logs, and I could see that it was timing out trying to transfer data from one of the other slave nodes - an upstream mapper node. Upon closer scrutiny of the the log file I realized that Hadoop was trying to transfer from the other slave's public IP address, which is behind a firewall that blocks public access.
Key take-away number one: when you're just starting out with Hadoop, if you only have one slave, you've only demonstrated one real communication path: master-to-slave. Your cluster isn't doing any slave-to-slave transfers because everything was on the one slave. Also, our initial job had no reducer, so it ran fine on the new, 4-node cluster because it was still only master-slave communication.
For some reason, the mapper slave was advertising the location of the map output data via its public IP address. My first attempt at fixing this problem involved the dfs.datanode.dns.interface configuration parameter (and it's mapred equivalent). This tells Hadoop that when a process (mapred or dfs) wants to figure out it's host name, use the IP address associated with the given interface. (You could even have dfs and mapred using separate interfaces for additional throughput.)
This failed for me because I had one interface with two addresses, not two interfaces. I dug through the Hadoop DNS code (org.apache.hadoop.net.DNS - God, I love open-source: you can just look for yourself) and saw that if there is one interface, the code loops through the IP addresses and performs reverse DNS lookups and takes the first successful result. I was fortunate in that the private IP address was coming up first in that enumeration of the IPs on the interface, but it still wasn't working. I talked to our system admin/configuration guru. It turns out that our hosting provider doesn't provide reverse DNS for those private IP addresses. We could have set up our own DNS server for just these reverse lookups, but there was a brute-force option available to us.
You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide. Now, in theory, this might be onerous - it means you have a different configuration file per-slave. However, our cluster is configured and maintained via Puppet. So, our puppet master just tweaked his Puppet script, and we never looked back.
Take-away number two: Exceeded MAX_FAILED_UNIQUE_FETCHES could mean a simple connectivity problem. I'm sure there are other possible causes, but an inability to connect between slaves is comparatively simple to troubleshoot.
enjoy,
Charles.
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
I went out to the slave node and looked through the job logs, and I could see that it was timing out trying to transfer data from one of the other slave nodes - an upstream mapper node. Upon closer scrutiny of the the log file I realized that Hadoop was trying to transfer from the other slave's public IP address, which is behind a firewall that blocks public access.
Key take-away number one: when you're just starting out with Hadoop, if you only have one slave, you've only demonstrated one real communication path: master-to-slave. Your cluster isn't doing any slave-to-slave transfers because everything was on the one slave. Also, our initial job had no reducer, so it ran fine on the new, 4-node cluster because it was still only master-slave communication.
For some reason, the mapper slave was advertising the location of the map output data via its public IP address. My first attempt at fixing this problem involved the dfs.datanode.dns.interface configuration parameter (and it's mapred equivalent). This tells Hadoop that when a process (mapred or dfs) wants to figure out it's host name, use the IP address associated with the given interface. (You could even have dfs and mapred using separate interfaces for additional throughput.)
This failed for me because I had one interface with two addresses, not two interfaces. I dug through the Hadoop DNS code (org.apache.hadoop.net.DNS - God, I love open-source: you can just look for yourself) and saw that if there is one interface, the code loops through the IP addresses and performs reverse DNS lookups and takes the first successful result. I was fortunate in that the private IP address was coming up first in that enumeration of the IPs on the interface, but it still wasn't working. I talked to our system admin/configuration guru. It turns out that our hosting provider doesn't provide reverse DNS for those private IP addresses. We could have set up our own DNS server for just these reverse lookups, but there was a brute-force option available to us.
You can bypass all of Hadoop's efforts to automatically figure out the slave's host name by specifying the slave.host.name parameter in the configuration files. If that is set, Hadoop will just take your word for it and use the name you provide. Now, in theory, this might be onerous - it means you have a different configuration file per-slave. However, our cluster is configured and maintained via Puppet. So, our puppet master just tweaked his Puppet script, and we never looked back.
Take-away number two: Exceeded MAX_FAILED_UNIQUE_FETCHES could mean a simple connectivity problem. I'm sure there are other possible causes, but an inability to connect between slaves is comparatively simple to troubleshoot.
enjoy,
Charles.
Subscribe to:
Posts (Atom)