Tuesday, April 15, 2014

Living Dangerously with MySQL Fatal Error 1236

This past weekend, the data center where our MySQL master resides suffered some issues.  At first I thought it was just some connectivity issues, but it was a power outage and our nodes were all rebooted.  While cleaning up various messes from that, I discovered that our D/R slave in another data center was stuck with the error message:
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.014593' at 52888137, the last event read from '/var/log/mysql/mysql-bin.014593' at 4, the last byte read from '/var/log/mysql/mysql-bin.014593' at 4.'

Googling around, I found that this error means that the slave got more data from the master than what the master wrote to its log before the crash.  The result was that the slave was asking for data beyond the end of the log file - the master started a new log when it restarted.

I wondered if it would be possible to just move on to the next log, and looking at more postings, I found that it is possible.

NOTE: this procedure is dangerous and may lead to data loss or corruption.  I would never do it on a truly critical system like a financial system.

I figured it was worth a try.  In the worst case, I would hose the slave and have to rebuild from a fresh dump, which was the only other alternative.  I also realized that when the slave restarted, there might be some replication issues around that area in the "log transition."

As the blonde said, "do you want to live forever?"

So, I stopped the slave, moved it to the beginning of the next log file and started it again.
STOP SLAVE;
CHANGE MASTER TO MASTER_LOG_POS = 4;
CHANGE MASTER TO MASTER_LOG_FILE = 'mysql-bin. 014593';
START SLAVE;

As anticipated, there were issues.  There were a number of UPDATE statements that couldn't be applied because of a missing row.  I steered around them, one at a time with:
STOP SLAVE ; 
SET GLOBAL SQL_SLAVE_SKIP_COUNTER =1; 
START SLAVE ;

It was a hassle, and it took interventions that I expected, but it was quicker than shutting down my production applications to take a consistent dump, transferring, and restoring it.  And, while I was babysitting it, I could write a blog post.

Your milage may vary,
Charles.