orchestrator icon indicating copy to clipboard operation
orchestrator copied to clipboard

Moving SQL_Delay host connection issues

Open tomkrouper opened this issue 8 years ago • 1 comments

I ran into an issue while moving a delayed host under a peer that was not delayed.

Command:

orchestrator -c relocate -i delayed.host.com -d peer.host.com

The relocate command stops replication on peer.host.com and runs a START SLAVE UNTIL... command on delayed.host.com. That completed and delayed.host.com (which is delayed for several hours) did wait, and did move below peer.host.com. However, peer.host.com did not restart replication. I assume this is due to a timeout to the connection.

I would recommend setting up the connection to retry if connection times out.

If I wanted to make a feature request, It also might make sense to allow a relocate to change SQL_Delay temporarily so that the host can catch up faster and be moved before a timeout even happens. That might need to be an optional item since the command really doesn't know why the host is delayed, and un-delaying it could cause issues.

Unfortunately I didn't have logging setup for this, so I can't be 100% sure my assessment is completely accurate. So read everything I said as, "This is how I saw it."

tomkrouper avatar Oct 25 '16 01:10 tomkrouper

START SLAVE UNTIL suggests the server was relocated via standard "move", i.e. by coordinating binlog positions -- whereas we would have expected it to relocate via pseudo-gtid.

When relocating via pseudo-gtid there is no problem at all with delayed replicas. That is, it takes longer to compute the coordinates from which they should replicate, because a more exhaustive search of binary logs is involved; but otherwise it isn't a big deal.

So the problem is: why did orchestrator choose to use "classic" move rather than pseudo-gtid.

shlomi-noach avatar Oct 26 '16 09:10 shlomi-noach