moco icon indicating copy to clipboard operation
moco copied to clipboard

When MTS is enabled, replication may not resume if a replica node crashes.

Open shunki-fujita opened this issue 3 years ago • 1 comments

Describe the bug If MTS is enabled, replication may not resume when a replica node crashes. Specifically, the following error will occur.

{"level":"error","ts":1634545113.4992788,"logger":"***","msg":"error","error":"failed to configure replica instance 2: failed to change primary: Error 1802: CHANGE MASTER cannot be executed when the slave was stopped with an error or killed in MTS mode. Consider using RESET SLAVE or START SLAVE UNTIL.","stacktrace":"github.com/cybozu-go/moco/clustering.(*clusterManager).update.func1\n\t/work/clustering/manager.go:89"}

The documentation says the following: https://dev.mysql.com/doc/refman/8.0/en/start-replica.html

When using a multithreaded replica (replica_parallel_workers or slave_parallel_workers > 0), there is a chance of gaps in the sequence of transactions that have been executed from the relay log in the following cases:

  • The coordinator thread was stopped
  • An error occurs in the applier threads
  • mysqld shuts down unexpectedly

If a gap occurs, an error will occur when CHANGE MASTER is executed.

Before MySQL 8.0.26, issuing START REPLICA on a multithreaded replica with gaps in the sequence of transactions executed from the relay log generates a warning. In such a situation, the solution is to use START REPLICA UNTIL SQL_AFTER_MTS_GAPS, then issue RESET REPLICA to remove any remaining relay logs.

Therefore, the following execution will fail https://github.com/cybozu-go/moco/blob/04d00bbee82abccc907e69bad763d8b303bc330e/pkg/dbop/replication.go#L14

Environment MySQL 8.0.18

To Reproduce

  1. create a table
  2. use a while true loop to INSERT the appropriate data
  3. enter replica pod and kill -6 $(pidof mysqld) command to kill mysqld
  4. Replica node's mysqld may not become ready.

In that case, Error 1802 is written in the log of moco-controller.

Expected behavior If a replica node crashes and a gap occurs, it will automatically close the gap and resume replication.

shunki-fujita avatar Oct 18 '21 09:10 shunki-fujita

Thank you for the report. This seems a limitation for MySQL prior to 8.0.26.

ymmt2005 avatar Oct 18 '21 09:10 ymmt2005