gh-ost
gh-ost copied to clipboard
because lock is not release, drop cutover sentry table is hanged
reason
- when atomicCutOver() -> this.waitForEventsUpToLock() != nil
https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/migrator.go#L668-L670 - then will drop sentry table
- There are two places to delete sentry table:
- First place: atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists() https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/migrator.go#L644-L649
- Second place: AtomicCutOverMagicLock() - > dropCutOverSentryTableOnce.Do -> tx.Exec(query) https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L974-L984
- but, two places use Once.Do()。Who executes first depends on luck!
- Unfortunately,when 'atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists() ' was executed firstly,because the lock of lock sentry table has not been released, and because the db connection db.SetMaxOpenConns(3)
https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/mysql/utils.go#L56-L57 so, the first place to drop sentry table will use a different MySQL thread, will be hanged until db connection timeout.
behavior
-
drop _del table will hanged until db connection timeout
-
Partial execution log
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Row copy complete
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Grabbing voluntary lock: gh-ost.84571662.lock
2022-04-07 19:25:31 INFO Setting LOCK timeout as 6 seconds
2022-04-07 19:25:31 INFO Looking for magic cut-over table
2022-04-07 19:25:31 INFO Creating magic cut-over table `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Magic cut-over table created
2022-04-07 19:25:31 INFO Locking `gallery`.`time_info__tablet_1923`, `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Tables locked
2022-04-07 19:25:31 INFO Session locking original & magic tables is 84571662
2022-04-07 19:25:31 INFO Writing changelog state: AllEventsUpToLockProcessed:1649330731511157653
2022-04-07 19:25:31 INFO Waiting for events up to lock
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 10s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 1.30s, HeartbeatLag: 1.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 11s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 2.30s, HeartbeatLag: 2.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 12s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 3.30s, HeartbeatLag: 3.80s, State: migrating; ETA: due
2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 ERROR 2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 INFO Looking for magic cut-over table
2022-04-07 19:25:34 INFO Will now proceed to drop magic table and unlock tables
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping table `gallery`.`_time_info__tablet_1923_20220407192522_del`
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 13s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 4.30s, HeartbeatLag: 4.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 14s(total), 8s(copy); streamer: mysql-bin.037492:236853172; Lag: 5.30s, HeartbeatLag: 5.80s, State: migrating; ETA: due
solution
Which fun to create sentry table, drop the sentry table should to be in the same fun, Is it right? https://github.com/github/gh-ost/pull/1180
@shlomi-noach Please review when you have time the PR: https://github.com/github/gh-ost/pull/1180
@shlomi-noach Please review when you have time, thank you very much!
@timvaillancourt @dm-2 @jsoref Please review when you have time, thank you very much!
:wave: @lmtwga, thanks for the issue and PR!
Nice debugging! Just to summarise what's happening:
- If
atomicCutOver
function exits because of an error (i.e. beforetrue
is sent to theokToUnlockTable
channel), it executes itsdefer
block whilstAtomicCutOverMagicLock
function still holds the table lock that includes the sentry table - The
defer
block foratomicCutOver
sendstrue
to theokToUnlockTable
channel, causing a race to thedropCutOverSentryTableOnce
-protected drop of the sentry table in both theatomicCutOver
andAtomicCutOverMagicLock
functions - If
atomicCutOver
wins the race, the drop table is blocked due to the lock held byAtomicCutOverMagicLock
, andAtomicCutOverMagicLock
is blocked becausesync.Once.Do(f)
blocks until the first execution off
completes successfully (which is the drop table that is blocked) - causing a deadlock until either database connection times out or is disconnected - If
AtomicCutOverMagicLock
wins the race, the sentry table is dropped successfully and no deadlock occurs
The deadlock only happens after we've failed the cut-over due to an error, so it has no impact on successful gh-ost migrations.
The impact of this deadlock scenario is that it will delay the retry of failed cut-overs if atomicCutOver
wins the race detailed in (2) above.
@dm-2
Thank you very much for your reply!
I totally agree with what you said above.
In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once.
https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L943-L947
The lock will block all subsequent operations. For example, as shown in the figure below:
the operations in the green boxes are blocked.
@dm-2 Thank you very much for your reply! I totally agree with what you said above. In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once.
https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L943-L947
The lock will block all subsequent operations. For example, as shown in the figure below:
the operations in the green boxes are blocked.
Good point! I'll take a look at the PR again :+1:
Fixed by https://github.com/github/gh-ost/pull/1180