resource-agents
resource-agents copied to clipboard
Prevent outdated/broken mysql slave being promoted as master
This merge request greatly enhance mysql resource stability. It tries to address cases that happened in live environments:
-
broken/outdated slave promoted as mysql master due to insufficient error checking in mysql monitor for slaves
-
broken/outdated slave promoted as mysql master due to specific (not-so-smart default) configuration requirements unrelated to check_slave state (test_table set and OCF_CHECK_LEVEL > 0)
-
broken/outdated slave promoted as mysql master due to master scores not being reclaimed/removed from slaves that have serious replication errors or are in state unsuitable for master promotion.
This merge request is a result of multiple hours spent during on-call and out-of-call firefighting, manual out-of-sync master-merging, postmortem writing, debugging crm and its default mysql resource agent.
IMPORTANT changes
-
check_slave is now ALWAYS called on slave in ms mode - Previously it was called only if test_table was set, which I assume to be due to an error. To prevent stale/broken slave from being promoted as master I strongly recommend calling check_slave by default on slaves.
-
in check_slave $Slave_SQL_Running/$Slave_IO_Running == 'NO' verify slave vs cluster master before START SLAVE is now performed only if ensure_follow_correct_master=true (new var. default is false)
-
Create empty file mysql_master_lock_file=/var/lock/replication_master (new var.) on master promote, erase it on master demote. Useful for for external scripts.
Introducing new function quit_on_error that is used across monitor:
- uses common function to terminate with success or error - quit_on_error
- quit_on_error always clear reader vip so erroneous slave does not receive reads
- quit_on_error always delete master score so erroneous does not get promoted as master
- quit_on_error log the error
- quit_on_error can terminate with the desired error code
Enhanced check_slave to detect failed or stale slaves:
- Consider slave as failed if Last_SQL_Error/Last_SQL_Errno
- Consider slave as failed if Last_IO_Error/Last_IO_Errno
- Consider slave as failed if Last_Error/Last_Errno
- Attempt to start slave $Slave_SQL_Running/$Slave_IO_Running NO
- Code simplification for evict_outdated_slaves handling
- Code simplification for slave master score handling
Code readability greatly simplified.
Is there any chance you could clean up this series of commits a bit? It's difficult to review when there are so many commits to look at. Thanks!
@krig I intentionally split each change in different commits.
- Each commit addresses single problem at a time.
- This should allow the one who review the code to follow the idea of the author and see why exactly is he doing particular modification
- I assume that it should be easier for one to review changes of 5-10 lines per commit instead of getting 300 lines diff where one can easily lose track of small but important details.
- If there are multiple commits that should be merged like in my request one can easily see the entire diff.
However if you insist I can make the modifications requested. Just explain what do you mean by clean-up. Doing more than 1 change per commit?
I guess maybe it's just a lot of changes! It looked more like an evolution over time that could perhaps be condensed. I will take another look.