gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

cut over not retrying ?

Open babinomec opened this issue 3 years ago • 9 comments

Hi,

When running gh-ost, the cut-over part often fails due to timeout (unfortunately our slaves often have some lags / spikes of lag). This is not a problem in itself but it does not seem to wait and retry at a later time which I think it is supposed to do.

I am running version 1.1.1

(redacted command and log)

This is what I am running (alter just adds some columns) :

./gh-ost \
  --user="ghost" \
  --password="********" \
  --host="slave" \
  --allow-master-master \
  --assume-master-host="master" \
  --database="db" \
  --table="table" \
  --alter="alter" \
  --verbose \
  --skip-foreign-key-checks \
  --skip-strict-mode \
  --initially-drop-ghost-table \
  --initially-drop-old-table \
  --exact-rowcount \
  --ok-to-drop-table \
  --max-lag-millis=10000 \
  --execute

Here is the output at the cut-over phase :

Copy: 14327000/14362256 99.8%; Applied: 12041; Backlog: 0/1000; Time: 1h16m10s(total), 1h15m58s(copy); streamer: bin-log.005490:906474118; Lag: 2.68s, State: migrating; ETA: 11s
Copy: 14334000/14362256 99.8%; Applied: 12047; Backlog: 2/1000; Time: 1h16m11s(total), 1h15m59s(copy); streamer: bin-log.005490:920606200; Lag: 2.38s, State: migrating; ETA: 8s
Copy: 14342000/14362256 99.9%; Applied: 12053; Backlog: 0/1000; Time: 1h16m12s(total), 1h16m0s(copy); streamer: bin-log.005490:934403718; Lag: 1.78s, State: migrating; ETA: 6s
Copy: 14350000/14362256 99.9%; Applied: 12055; Backlog: 0/1000; Time: 1h16m13s(total), 1h16m1s(copy); streamer: bin-log.005490:947459950; Lag: 1.48s, State: migrating; ETA: 3s
Copy: 14359000/14362256 100.0%; Applied: 12058; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:961146997; Lag: 1.18s, State: migrating; ETA: 1s
2021-05-18 18:18:18 INFO Row copy complete
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m14s(total), 1h16m2s(copy); streamer: bin-log.005490:966837215; Lag: 1.08s, State: migrating; ETA: due
2021-05-18 18:18:18 INFO Grabbing voluntary lock: gh-ost.1024388992.lock
2021-05-18 18:18:18 INFO Setting LOCK timeout as 6 seconds
2021-05-18 18:18:18 INFO Looking for magic cut-over table
2021-05-18 18:18:18 INFO Creating magic cut-over table `db`.`_table_del`
2021-05-18 18:18:18 INFO Magic cut-over table created
2021-05-18 18:18:18 INFO Locking `db`.`table`, `db`.`_table_del`
2021-05-18 18:18:18 INFO Tables locked
2021-05-18 18:18:18 INFO Session locking original & magic tables is 1024388992
2021-05-18 18:18:18 INFO Writing changelog state: AllEventsUpToLockProcessed:1621354698945390596
2021-05-18 18:18:18 INFO Waiting for events up to lock
Copy: 14362256/14362256 100.0%; Applied: 12060; Backlog: 0/1000; Time: 1h16m15s(total), 1h16m2s(copy); streamer: bin-log.005490:970667197; Lag: 1.28s, State: migrating; ETA: due
2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 ERROR 2021-05-18 18:18:21 ERROR Timeout while waiting for events up to lock
2021-05-18 18:18:21 INFO Looking for magic cut-over table
2021-05-18 18:18:21 INFO Will now proceed to drop magic table and unlock tables
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping magic cut-over table
2021-05-18 18:18:21 INFO Dropping table `db`.`_table_del`
2021-05-18 18:18:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-05-18 18:18:24 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 1/1000; Time: 1h16m20s(total), 1h16m2s(copy); streamer: bin-log.005490:989474686; Lag: 3.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m25s(total), 1h16m2s(copy); streamer: bin-log.005490:1014763533; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m30s(total), 1h16m2s(copy); streamer: bin-log.005490:1026013234; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m35s(total), 1h16m2s(copy); streamer: bin-log.005490:1036196578; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m40s(total), 1h16m2s(copy); streamer: bin-log.005490:1047207998; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m45s(total), 1h16m2s(copy); streamer: bin-log.005490:1057140917; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m50s(total), 1h16m2s(copy); streamer: bin-log.005490:1062986858; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h16m55s(total), 1h16m2s(copy); streamer: bin-log.005490:1064318221; Lag: 4.48s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m0s(total), 1h16m2s(copy); streamer: bin-log.005490:1073646385; Lag: 4.78s, State: migrating; ETA: due
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
[2021/05/18 18:19:04] [info] binlogsyncer.go:723 rotate to (bin-log.005491, 4)
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:1073742500 to bin-log.005491
2021-05-18 18:19:04 INFO rotate to next log from bin-log.005491:0 to bin-log.005491
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m5s(total), 1h16m2s(copy); streamer: bin-log.005491:14353974; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m10s(total), 1h16m2s(copy); streamer: bin-log.005491:26928263; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h17m30s(total), 1h16m2s(copy); streamer: bin-log.005491:64743000; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m0s(total), 1h16m2s(copy); streamer: bin-log.005491:118334528; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h18m30s(total), 1h16m2s(copy); streamer: bin-log.005491:176850550; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m0s(total), 1h16m2s(copy); streamer: bin-log.005491:236619046; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h19m30s(total), 1h16m2s(copy); streamer: bin-log.005491:292654898; Lag: 0.08s, State: migrating; ETA: due
# Migrating `db`.`table`; Ghost table is `db`.`_table_gho`
# Migrating master:3306; inspecting slave:3306; executing on slave
# Migration started at Tue May 18 17:02:04 +0200 2021
# chunk-size: 1000; max-lag-millis: 10000ms; dml-batch-size: 10; max-load: ; critical-load: ; nice-ratio: 0.000000
# throttle-additional-flag-file: /tmp/gh-ost.throttle
# Serving on unix socket: /tmp/gh-ost.db.table.sock
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m0s(total), 1h16m2s(copy); streamer: bin-log.005491:359697150; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h20m30s(total), 1h16m2s(copy); streamer: bin-log.005491:427123115; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m0s(total), 1h16m2s(copy); streamer: bin-log.005491:498372220; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h21m30s(total), 1h16m2s(copy); streamer: bin-log.005491:564757429; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m0s(total), 1h16m2s(copy); streamer: bin-log.005491:627971561; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h22m30s(total), 1h16m2s(copy); streamer: bin-log.005491:700060664; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m0s(total), 1h16m2s(copy); streamer: bin-log.005491:762556638; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h23m30s(total), 1h16m2s(copy); streamer: bin-log.005491:830434663; Lag: 1.18s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m0s(total), 1h16m2s(copy); streamer: bin-log.005491:898773371; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h24m30s(total), 1h16m2s(copy); streamer: bin-log.005491:972940697; Lag: 0.08s, State: migrating; ETA: due
Copy: 14362256/14362256 100.0%; Applied: 12061; Backlog: 0/1000; Time: 1h25m0s(total), 1h16m2s(copy); streamer: bin-log.005491:1033621038; Lag: 0.08s, State: migrating; ETA: due

And it goes on like that even when lag is low and that cut-over could happen. I ctrl-c-ed it after 2h50m.

Any clue on what may be happening and how to fix it ?

Thanks

babinomec avatar May 19 '21 10:05 babinomec

This was fixed downstream in https://github.com/openark/gh-ost/pull/14, and submitted upstream, but not yet merged upstream. Would you like to try the downstream release https://github.com/openark/gh-ost/releases/tag/v1.1.3?

shlomi-noach avatar May 19 '21 10:05 shlomi-noach

Awesome, thanks for the quick reply. I will try this version in a bit and let you know.

babinomec avatar May 19 '21 10:05 babinomec

Seems to work correctly with this version indeed. Perhaps it would have worked with the other one also but having the information about gh-ost's lag vs. replication lag allows at least to understand why it is waiting.

Thanks!

babinomec avatar May 20 '21 08:05 babinomec

Maybe I spoke too fast, got the same issue again today.

(current version is 1.1.3 from the openark form)

Copy: 15911000/15911000 100.0%; Applied: 2233; Backlog: 0/1000; Time: 1h10m14s(total), 1h10m11s(copy); streamer: bin-log.006937:267547227; Lag: 6.63s, HeartbeatLag: 6.60s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2233; Backlog: 0/1000; Time: 1h10m15s(total), 1h10m11s(copy); streamer: bin-log.006937:279534137; Lag: 6.43s, HeartbeatLag: 6.53s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2237; Backlog: 0/1000; Time: 1h10m20s(total), 1h10m11s(copy); streamer: bin-log.006937:341945688; Lag: 4.93s, HeartbeatLag: 4.93s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2237; Backlog: 0/1000; Time: 1h10m25s(total), 1h10m11s(copy); streamer: bin-log.006937:359880719; Lag: 4.03s, HeartbeatLag: 4.03s, State: migrating; ETA: due
2021-05-21 14:56:13 INFO Grabbing voluntary lock: gh-ost.1115392216.lock
2021-05-21 14:56:13 INFO Setting LOCK timeout as 6 seconds
2021-05-21 14:56:13 INFO Looking for magic cut-over table
2021-05-21 14:56:13 INFO Creating magic cut-over table `db`.`_table_del`
2021-05-21 14:56:13 INFO Magic cut-over table created
2021-05-21 14:56:13 INFO Locking `db`.`table`, `db`.`_table_del`
2021-05-21 14:56:13 INFO Tables locked
2021-05-21 14:56:13 INFO Session locking original & magic tables is 1115392216
2021-05-21 14:56:13 INFO Writing changelog state: AllEventsUpToLockProcessed:1621601773040517807
2021-05-21 14:56:13 INFO Waiting for events up to lock
Copy: 15911000/15911000 100.0%; Applied: 2243; Backlog: 0/1000; Time: 1h10m30s(total), 1h10m11s(copy); streamer: bin-log.006937:387809202; Lag: 4.53s, HeartbeatLag: 4.53s, State: migrating; ETA: due
2021-05-21 14:56:16 ERROR Timeout while waiting for events up to lock
2021-05-21 14:56:16 ERROR 2021-05-21 14:56:16 ERROR Timeout while waiting for events up to lock
2021-05-21 14:56:16 INFO Looking for magic cut-over table
2021-05-21 14:56:16 INFO Will now proceed to drop magic table and unlock tables
2021-05-21 14:56:16 INFO Dropping magic cut-over table
2021-05-21 14:56:16 INFO Dropping magic cut-over table
2021-05-21 14:56:16 INFO Dropping table `db`.`_table_del`
2021-05-21 14:56:19 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-05-21 14:56:19 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h10m35s(total), 1h10m11s(copy); streamer: bin-log.006937:413579459; Lag: 3.03s, HeartbeatLag: 3.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h10m40s(total), 1h10m11s(copy); streamer: bin-log.006937:435058078; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h10m45s(total), 1h10m11s(copy); streamer: bin-log.006937:447752227; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h10m50s(total), 1h10m11s(copy); streamer: bin-log.006937:450379630; Lag: 4.03s, HeartbeatLag: 4.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h10m55s(total), 1h10m11s(copy); streamer: bin-log.006937:471709329; Lag: 0.93s, HeartbeatLag: 0.93s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m0s(total), 1h10m11s(copy); streamer: bin-log.006937:486462162; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m5s(total), 1h10m11s(copy); streamer: bin-log.006937:499620799; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m10s(total), 1h10m11s(copy); streamer: bin-log.006937:511186597; Lag: 0.43s, HeartbeatLag: 0.43s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m15s(total), 1h10m11s(copy); streamer: bin-log.006937:516193406; Lag: 3.33s, HeartbeatLag: 3.33s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m20s(total), 1h10m11s(copy); streamer: bin-log.006937:516193406; Lag: 8.33s, HeartbeatLag: 8.33s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m25s(total), 1h10m11s(copy); streamer: bin-log.006937:533347595; Lag: 7.53s, HeartbeatLag: 7.53s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h11m30s(total), 1h10m11s(copy); streamer: bin-log.006937:533347595; Lag: 12.53s, HeartbeatLag: 12.53s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h12m0s(total), 1h10m11s(copy); streamer: bin-log.006937:659824614; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h12m30s(total), 1h10m11s(copy); streamer: bin-log.006937:736770776; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h13m0s(total), 1h10m11s(copy); streamer: bin-log.006937:797854118; Lag: 1.43s, HeartbeatLag: 1.43s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h13m30s(total), 1h10m11s(copy); streamer: bin-log.006937:891513490; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h14m0s(total), 1h10m11s(copy); streamer: bin-log.006937:977597349; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 15911000/15911000 100.0%; Applied: 2245; Backlog: 0/1000; Time: 1h14m30s(total), 1h10m11s(copy); streamer: bin-log.006937:1042223050; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due

What concerns me a bit is that the DROP _table_del was still running in background and the lock was apparently still active :

anon

Unfortunately I don't have a text version of the process list but basically :

  • ghost was trying to drop the _table_del and somehow waiting for a lock
  • another script (w/ multiple connections ) was trying to update lines in a view that references the original table and was probably blocked by the ghost cut-over lock (these were the only queries on these tables)

Any clue on what may be happening ? Is it possible that the lock does not work as expected when working with views (i.e. updating in a view that references a table being ghosted) ?

Thanks

babinomec avatar May 21 '21 14:05 babinomec

Got the same issue again yesterday :

Copy: 9207000/9237527 99.7%; Applied: 19; Backlog: 0/1000; Time: 35m53s(total), 35m52s(copy); streamer: bin-log.014994:469068082; Lag: 5.93s, HeartbeatLag: 5.93s, State: migrating; ETA: 7s                                                 
Copy: 9214000/9237527 99.7%; Applied: 19; Backlog: 0/1000; Time: 35m54s(total), 35m53s(copy); streamer: bin-log.014994:481593121; Lag: 5.34s, HeartbeatLag: 5.44s, State: migrating; ETA: 5s                                                 
Copy: 9222000/9237527 99.8%; Applied: 20; Backlog: 0/1000; Time: 35m55s(total), 35m54s(copy); streamer: bin-log.014994:497360102; Lag: 4.33s, HeartbeatLag: 4.43s, State: migrating; ETA: 3s                                                 
Copy: 9228000/9237527 99.9%; Applied: 20; Backlog: 0/1000; Time: 35m56s(total), 35m55s(copy); streamer: bin-log.014994:512804130; Lag: 3.33s, HeartbeatLag: 3.44s, State: migrating; ETA: 2s                                                 
Copy: 9235000/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 35m57s(total), 35m56s(copy); streamer: bin-log.014994:527849931; Lag: 2.63s, HeartbeatLag: 2.63s, State: migrating; ETA: due                                               
2021-06-08 21:46:36 INFO Row copy complete                                                                                                                                                                                                   
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 35m57s(total), 35m56s(copy); streamer: bin-log.014994:531392873; Lag: 2.44s, HeartbeatLag: 2.47s, State: migrating; ETA: due                                               
2021-06-08 21:46:36 INFO Grabbing voluntary lock: gh-ost.1725224619.lock                                                                                                                                                                     
2021-06-08 21:46:36 INFO Setting LOCK timeout as 6 seconds                                                                                                                                                                                   
2021-06-08 21:46:36 INFO Looking for magic cut-over table
2021-06-08 21:46:36 INFO Creating magic cut-over table `<database>`.`_<table>_del`
2021-06-08 21:46:36 INFO Magic cut-over table created
2021-06-08 21:46:36 INFO Locking `<database>`.`<table>`, `<database>`.`_<table>_del`
2021-06-08 21:46:36 INFO Tables locked                                                                                                                                                                                                       
2021-06-08 21:46:36 INFO Session locking original & magic tables is 1725224619                                                                                                                                                               
2021-06-08 21:46:36 INFO Writing changelog state: AllEventsUpToLockProcessed:1623181596730578406                                                                                                                                             
2021-06-08 21:46:36 INFO Waiting for events up to lock                                                                                                                                                                                       
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m0s(total), 35m56s(copy); streamer: bin-log.014994:540743467; Lag: 3.93s, HeartbeatLag: 3.94s, State: migrating; ETA: due                                                
2021-06-08 21:46:39 ERROR Timeout while waiting for events up to lock                                                                                                                                                                        
2021-06-08 21:46:39 ERROR 2021-06-08 21:46:39 ERROR Timeout while waiting for events up to lock
2021-06-08 21:46:39 INFO Looking for magic cut-over table
2021-06-08 21:46:39 INFO Will now proceed to drop magic table and unlock tables
2021-06-08 21:46:39 INFO Dropping magic cut-over table                                                                                                                                                                                       
2021-06-08 21:46:39 INFO Dropping magic cut-over table
2021-06-08 21:46:39 INFO Dropping table `<database>`.`_<table>_del`
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m5s(total), 35m56s(copy); streamer: bin-log.014994:540743467; Lag: 8.93s, HeartbeatLag: 8.94s, State: migrating; ETA: due
2021-06-08 21:46:45 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-06-08 21:46:45 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m10s(total), 35m56s(copy); streamer: bin-log.014994:583182266; Lag: 0.03s, HeartbeatLag: 0.03s, State: migrating; ETA: due
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m15s(total), 35m56s(copy); streamer: bin-log.014994:593481005; Lag: 0.53s, HeartbeatLag: 0.53s, State: migrating; ETA: due
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m20s(total), 35m56s(copy); streamer: bin-log.014994:593481005; Lag: 5.53s, HeartbeatLag: 5.53s, State: migrating; ETA: due
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m25s(total), 35m56s(copy); streamer: bin-log.014994:605978818; Lag: 4.63s, HeartbeatLag: 4.73s, State: migrating; ETA: due
Copy: 9237527/9237527 100.0%; Applied: 20; Backlog: 0/1000; Time: 36m30s(total), 35m56s(copy); streamer: bin-log.014994:605978818; Lag: 9.64s, HeartbeatLag: 9.73s, State: migrating; ETA: due

Same problem : the DROP waits for a lock ; not sure exactly what is happening but it seems the locking somehow fails/times out :

2021-06-08 21:46:39 ERROR Timeout while waiting for events up to lock                                                                                                                                                                        
2021-06-08 21:46:39 ERROR 2021-06-08 21:46:39 ERROR Timeout while waiting for events up to lock

and when gh-ost tries to drop the table the lock on it (and maybe on other tables) is still held which blocks everything.

I have a few more to run, I will try to get some process list and engine innodb status to see what is going on.

babinomec avatar Jun 09 '21 09:06 babinomec

I don't have insight right now; I'm just thinking if you're willing to take the minor outage, use -cut-over=two-step

shlomi-noach avatar Jun 09 '21 10:06 shlomi-noach

Thanks, I will have a look on how that works and the impact it may have.

What kind of information would be useful to gather if this happens again ?

babinomec avatar Jun 09 '21 10:06 babinomec

What kind of information would be useful to gather if this happens again ?

definitely the log in -debug mode. Correlated database lag during cut-over attempts? Not sure what locking information we might capture.

shlomi-noach avatar Jun 09 '21 11:06 shlomi-noach

Here is the log with the debug mode :

Copy: 16343000/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m20s(total), 1h47m13s(copy); streamer: bin-log.015425:210398388; Lag: 1.29s, HeartbeatLag: 1.29s, State: migrating; ETA: 1s
2021-06-09 18:01:06 DEBUG Issued INSERT on range: [16343001]..[16344001]; iteration: 16343; chunk-size: 1000
2021-06-09 18:01:07 DEBUG Issued INSERT on range: [16344001]..[16345001]; iteration: 16344; chunk-size: 1000
Copy: 16345000/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m21s(total), 1h47m14s(copy); streamer: bin-log.015425:210398388; Lag: 2.29s, HeartbeatLag: 2.29s, State: migrating; ETA: due
2021-06-09 18:01:07 DEBUG Issued INSERT on range: [16345001]..[16345685]; iteration: 16345; chunk-size: 1000
2021-06-09 18:01:07 DEBUG Iteration complete: no further range to iterate
2021-06-09 18:01:07 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:07 INFO Row copy complete
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m21s(total), 1h47m14s(copy); streamer: bin-log.015425:210398388; Lag: 2.49s, HeartbeatLag: 2.54s, State: migrating; ETA: due
2021-06-09 18:01:07 DEBUG checking for cut-over postpone
2021-06-09 18:01:07 DEBUG checking for cut-over postpone: complete
2021-06-09 18:01:07 INFO Grabbing voluntary lock: gh-ost.1756336414.lock
2021-06-09 18:01:07 INFO Setting LOCK timeout as 6 seconds
2021-06-09 18:01:07 INFO Looking for magic cut-over table
2021-06-09 18:01:07 INFO Creating magic cut-over table `<database>`.`_<table>_del`
2021-06-09 18:01:07 INFO Magic cut-over table created
2021-06-09 18:01:07 INFO Locking `<database>`.`<table>`, `<database>`.`_<table>_del`
2021-06-09 18:01:07 INFO Tables locked
2021-06-09 18:01:07 INFO Session locking original & magic tables is 1756336414
2021-06-09 18:01:07 INFO Writing changelog state: AllEventsUpToLockProcessed:1623254467702111840
2021-06-09 18:01:07 INFO Waiting for events up to lock
2021-06-09 18:01:08 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:09 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:10 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:10 ERROR Timeout while waiting for events up to lock
2021-06-09 18:01:10 ERROR 2021-06-09 18:01:10 ERROR Timeout while waiting for events up to lock
2021-06-09 18:01:10 INFO Looking for magic cut-over table
2021-06-09 18:01:10 INFO Will now proceed to drop magic table and unlock tables
2021-06-09 18:01:10 INFO Dropping magic cut-over table
2021-06-09 18:01:10 INFO Dropping magic cut-over table
2021-06-09 18:01:10 INFO Dropping table `<database>`.`_<table>_del`
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m25s(total), 1h47m14s(copy); streamer: bin-log.015425:210398388; Lag: 6.29s, HeartbeatLag: 6.29s, State: migrating; ETA: due
2021-06-09 18:01:11 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:12 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:13 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:14 DEBUG Getting nothing in the write queue. Sleeping...
2021-06-09 18:01:15 INFO Intercepted changelog state AllEventsUpToLockProcessed
2021-06-09 18:01:15 INFO Handled changelog state AllEventsUpToLockProcessed
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m30s(total), 1h47m14s(copy); streamer: bin-log.015425:260330213; Lag: 5.09s, HeartbeatLag: 5.09s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m35s(total), 1h47m14s(copy); streamer: bin-log.015425:283059092; Lag: 7.59s, HeartbeatLag: 7.59s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m40s(total), 1h47m14s(copy); streamer: bin-log.015425:304844327; Lag: 9.39s, HeartbeatLag: 9.39s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m45s(total), 1h47m14s(copy); streamer: bin-log.015425:310089416; Lag: 13.89s, HeartbeatLag: 13.89s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m50s(total), 1h47m14s(copy); streamer: bin-log.015425:310089416; Lag: 18.89s, HeartbeatLag: 18.89s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h47m55s(total), 1h47m14s(copy); streamer: bin-log.015425:332170806; Lag: 20.89s, HeartbeatLag: 20.89s, State: migrating; ETA: due
Copy: 16345684/16345684 100.0%; Applied: 0; Backlog: 0/1000; Time: 1h48m0s(total), 1h47m14s(copy); streamer: bin-log.015425:415969736; Lag: 13.69s, HeartbeatLag: 13.69s, State: migrating; ETA: due

It seems to happen when server starts to lag during the cutover, I'll try to get more info next time it happens (but lag is very changing and spiky so not sure I can get an accurate value)

babinomec avatar Jun 10 '21 09:06 babinomec