gh-ost
gh-ost copied to clipboard
gh-ost in a loop of not being able to cut-over.
Situation:
Migration was 100% complete except for the binlog "Backlog" was at 1000/1000 and I didn't realize several hours behind and I attempted to cut-over.
Cut-over:
--postpone-cut-over-flag-file
was set. When I attempted to cut-over the cutover timed out. We do a socket command to unpostpone our migrations. Here is the log.
Copy: 1504139030/1504139030 100.0%; Applied: 958335966; Backlog: 1000/1000; Time: 146h50m49s(total), 126h57m23s(copy); streamer: mysql-bin.151538:57981919; State: postponing cut-over; ETA: due
2019-09-29 14:43:15 INFO Grabbing voluntary lock: gh-ost.220158630.lock
2019-09-29 14:43:15 INFO Setting LOCK timeout as 6 seconds
2019-09-29 14:43:15 INFO Looking for magic cut-over table
2019-09-29 14:43:15 INFO Creating magic cut-over table `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Magic cut-over table created
2019-09-29 14:43:15 INFO Locking `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Tables locked
2019-09-29 14:43:15 INFO Session locking original & magic tables is 220158630
2019-09-29 14:43:15 INFO Writing changelog state: AllEventsUpToLockProcessed:1569793395598070384
2019-09-29 14:43:15 INFO Waiting for events up to lock
Copy: 1504139030/1504139030 100.0%; Applied: 958341156; Backlog: 1000/1000; Time: 146h50m50s(total), 126h57m23s(copy); streamer: mysql-bin.151538:65463369; State: migrating; ETA: due
2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 ERROR 2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 INFO Looking for magic cut-over table
2019-09-29 14:43:18 INFO Will now proceed to drop magic table and unlock tables
2019-09-29 14:43:18 INFO Dropping magic cut-over table
2019-09-29 14:43:18 INFO Releasing lock from `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:18 INFO Tables unlocked
2019-09-29 14:43:19 INFO executing gh-ost-on-begin-postponed hook: /data/github/shell/bin/gh-ost-hooks/prod/gh-ost-on-begin-postponed-hook
Copy: 1504139030/1504139030 100.0%; Applied: 958355546; Backlog: 1000/1000; Time: 146h50m55s(total), 126h57m23s(copy); streamer: mysql-bin.151538:88596107; State: postponing cut-over; ETA: due
Where I probably screwed up to expose the bug:
In my hopes to get the cut-over to work, I went to the host that was running the migration and deleted the postpone flag file. This didn't seem to work, so after a short time I put the flag in place. Shortly after this @ggunson pointed out the binary lag to me.
So we waited. When things looked like they were close to caught up again, I attempted the socket cut-over again. It failed, but then we noticed something in the stream
. It stopped making any progress. It sat on the same binary log and position. We waited, it stayed the same. Then we tried to do a cut-over again. The stream
moved forward again. (Then got stuck again). At this point I tried to just delete the postpone file again, and I watched the migration attempt and fail to cut-over several times before the migration itself failed.
Is this a bug?
Probably yes. Is it going to be easy to reproduce? Probably no. If you have any thoughts, let me know.