orchestrator
orchestrator copied to clipboard
Moving SQL_Delay host connection issues
I ran into an issue while moving a delayed host under a peer that was not delayed.
Command:
orchestrator -c relocate -i delayed.host.com -d peer.host.com
The relocate command stops replication on peer.host.com
and runs a START SLAVE UNTIL...
command on delayed.host.com
. That completed and delayed.host.com
(which is delayed for several hours) did wait, and did move below peer.host.com
. However, peer.host.com
did not restart replication. I assume this is due to a timeout to the connection.
I would recommend setting up the connection to retry if connection times out.
If I wanted to make a feature request, It also might make sense to allow a relocate to change SQL_Delay
temporarily so that the host can catch up faster and be moved before a timeout even happens. That might need to be an optional item since the command really doesn't know why the host is delayed, and un-delaying it could cause issues.
Unfortunately I didn't have logging setup for this, so I can't be 100% sure my assessment is completely accurate. So read everything I said as, "This is how I saw it."
START SLAVE UNTIL
suggests the server was relocated via standard "move", i.e. by coordinating binlog positions -- whereas we would have expected it to relocate via pseudo-gtid
.
When relocating via pseudo-gtid
there is no problem at all with delayed replicas. That is, it takes longer to compute the coordinates from which they should replicate, because a more exhaustive search of binary logs is involved; but otherwise it isn't a big deal.
So the problem is: why did orchestrator
choose to use "classic" move rather than pseudo-gtid.