resource-agents icon indicating copy to clipboard operation
resource-agents copied to clipboard

Prevent outdated/broken mysql slave being promoted as master

Open vaLski opened this issue 7 years ago • 3 comments

This merge request greatly enhance mysql resource stability. It tries to address cases that happened in live environments:

  • broken/outdated slave promoted as mysql master due to insufficient error checking in mysql monitor for slaves

  • broken/outdated slave promoted as mysql master due to specific (not-so-smart default) configuration requirements unrelated to check_slave state (test_table set and OCF_CHECK_LEVEL > 0)

  • broken/outdated slave promoted as mysql master due to master scores not being reclaimed/removed from slaves that have serious replication errors or are in state unsuitable for master promotion.

This merge request is a result of multiple hours spent during on-call and out-of-call firefighting, manual out-of-sync master-merging, postmortem writing, debugging crm and its default mysql resource agent.

IMPORTANT changes

  • check_slave is now ALWAYS called on slave in ms mode - Previously it was called only if test_table was set, which I assume to be due to an error. To prevent stale/broken slave from being promoted as master I strongly recommend calling check_slave by default on slaves.

  • in check_slave $Slave_SQL_Running/$Slave_IO_Running == 'NO' verify slave vs cluster master before START SLAVE is now performed only if ensure_follow_correct_master=true (new var. default is false)

  • Create empty file mysql_master_lock_file=/var/lock/replication_master (new var.) on master promote, erase it on master demote. Useful for for external scripts.

Introducing new function quit_on_error that is used across monitor:

  • uses common function to terminate with success or error - quit_on_error
  • quit_on_error always clear reader vip so erroneous slave does not receive reads
  • quit_on_error always delete master score so erroneous does not get promoted as master
  • quit_on_error log the error
  • quit_on_error can terminate with the desired error code

Enhanced check_slave to detect failed or stale slaves:

  • Consider slave as failed if Last_SQL_Error/Last_SQL_Errno
  • Consider slave as failed if Last_IO_Error/Last_IO_Errno
  • Consider slave as failed if Last_Error/Last_Errno
  • Attempt to start slave $Slave_SQL_Running/$Slave_IO_Running NO
  • Code simplification for evict_outdated_slaves handling
  • Code simplification for slave master score handling

Code readability greatly simplified.

vaLski avatar Jul 26 '17 14:07 vaLski

Is there any chance you could clean up this series of commits a bit? It's difficult to review when there are so many commits to look at. Thanks!

krig avatar Aug 08 '17 06:08 krig

@krig I intentionally split each change in different commits.

  • Each commit addresses single problem at a time.
  • This should allow the one who review the code to follow the idea of the author and see why exactly is he doing particular modification
  • I assume that it should be easier for one to review changes of 5-10 lines per commit instead of getting 300 lines diff where one can easily lose track of small but important details.
  • If there are multiple commits that should be merged like in my request one can easily see the entire diff.

However if you insist I can make the modifications requested. Just explain what do you mean by clean-up. Doing more than 1 change per commit?

vaLski avatar Aug 08 '17 06:08 vaLski

I guess maybe it's just a lot of changes! It looked more like an evolution over time that could perhaps be condensed. I will take another look.

krig avatar Aug 08 '17 09:08 krig