gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

gh-ost in a loop of not being able to cut-over.

Open tomkrouper opened this issue 5 years ago • 6 comments

Situation:

Migration was 100% complete except for the binlog "Backlog" was at 1000/1000 and I didn't realize several hours behind and I attempted to cut-over.

Cut-over:

--postpone-cut-over-flag-file was set. When I attempted to cut-over the cutover timed out. We do a socket command to unpostpone our migrations. Here is the log.

Copy: 1504139030/1504139030 100.0%; Applied: 958335966; Backlog: 1000/1000; Time: 146h50m49s(total), 126h57m23s(copy); streamer: mysql-bin.151538:57981919; State: postponing cut-over; ETA: due
2019-09-29 14:43:15 INFO Grabbing voluntary lock: gh-ost.220158630.lock
2019-09-29 14:43:15 INFO Setting LOCK timeout as 6 seconds
2019-09-29 14:43:15 INFO Looking for magic cut-over table
2019-09-29 14:43:15 INFO Creating magic cut-over table `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Magic cut-over table created
2019-09-29 14:43:15 INFO Locking `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:15 INFO Tables locked
2019-09-29 14:43:15 INFO Session locking original & magic tables is 220158630
2019-09-29 14:43:15 INFO Writing changelog state: AllEventsUpToLockProcessed:1569793395598070384
2019-09-29 14:43:15 INFO Waiting for events up to lock
Copy: 1504139030/1504139030 100.0%; Applied: 958341156; Backlog: 1000/1000; Time: 146h50m50s(total), 126h57m23s(copy); streamer: mysql-bin.151538:65463369; State: migrating; ETA: due
2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 ERROR 2019-09-29 14:43:18 ERROR Timeout while waiting for events up to lock
2019-09-29 14:43:18 INFO Looking for magic cut-over table
2019-09-29 14:43:18 INFO Will now proceed to drop magic table and unlock tables
2019-09-29 14:43:18 INFO Dropping magic cut-over table
2019-09-29 14:43:18 INFO Releasing lock from `my_database`.`my_table`, `my_database`.`_my_table_20190923115225_del`
2019-09-29 14:43:18 INFO Tables unlocked
2019-09-29 14:43:19 INFO executing gh-ost-on-begin-postponed hook: /data/github/shell/bin/gh-ost-hooks/prod/gh-ost-on-begin-postponed-hook

Copy: 1504139030/1504139030 100.0%; Applied: 958355546; Backlog: 1000/1000; Time: 146h50m55s(total), 126h57m23s(copy); streamer: mysql-bin.151538:88596107; State: postponing cut-over; ETA: due

Where I probably screwed up to expose the bug:

In my hopes to get the cut-over to work, I went to the host that was running the migration and deleted the postpone flag file. This didn't seem to work, so after a short time I put the flag in place. Shortly after this @ggunson pointed out the binary lag to me.

So we waited. When things looked like they were close to caught up again, I attempted the socket cut-over again. It failed, but then we noticed something in the stream. It stopped making any progress. It sat on the same binary log and position. We waited, it stayed the same. Then we tried to do a cut-over again. The stream moved forward again. (Then got stuck again). At this point I tried to just delete the postpone file again, and I watched the migration attempt and fail to cut-over several times before the migration itself failed.

Is this a bug?

Probably yes. Is it going to be easy to reproduce? Probably no. If you have any thoughts, let me know.

tomkrouper avatar Sep 30 '19 15:09 tomkrouper