gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

because lock is not release, drop cutover sentry table is hanged

Open lmtwga opened this issue 2 years ago • 5 comments

reason

  • when atomicCutOver() -> this.waitForEventsUpToLock() != nil
    https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/migrator.go#L668-L670
  • then will drop sentry table
  • There are two places to delete sentry table:
    • First place: atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists() https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/migrator.go#L644-L649
    • Second place: AtomicCutOverMagicLock() - > dropCutOverSentryTableOnce.Do -> tx.Exec(query) https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L974-L984
  • but, two places use Once.Do()。Who executes first depends on luck!
  • Unfortunately,when 'atomicCutOver() -> defer func() -> dropCutOverSentryTableOnce.Do -> this.applier.DropAtomicCutOverSentryTableIfExists() ' was executed firstly,because the lock of lock sentry table has not been released, and because the db connection db.SetMaxOpenConns(3)
    https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/mysql/utils.go#L56-L57 so, the first place to drop sentry table will use a different MySQL thread, will be hanged until db connection timeout.

behavior

  • drop _del table will hanged until db connection timeout image

  • Partial execution log

Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Row copy complete
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 9s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 0.30s, HeartbeatLag: 0.80s, State: migrating; ETA: due
2022-04-07 19:25:31 INFO Grabbing voluntary lock: gh-ost.84571662.lock
2022-04-07 19:25:31 INFO Setting LOCK timeout as 6 seconds
2022-04-07 19:25:31 INFO Looking for magic cut-over table
2022-04-07 19:25:31 INFO Creating magic cut-over table `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Magic cut-over table created
2022-04-07 19:25:31 INFO Locking `gallery`.`time_info__tablet_1923`, `gallery`.`_time_info__tablet_1923_20220407192522_del`
2022-04-07 19:25:31 INFO Tables locked
2022-04-07 19:25:31 INFO Session locking original & magic tables is 84571662
2022-04-07 19:25:31 INFO Writing changelog state: AllEventsUpToLockProcessed:1649330731511157653
2022-04-07 19:25:31 INFO Waiting for events up to lock
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 10s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 1.30s, HeartbeatLag: 1.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 11s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 2.30s, HeartbeatLag: 2.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 12s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 3.30s, HeartbeatLag: 3.80s, State: migrating; ETA: due
2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 ERROR 2022-04-07 19:25:34 ERROR Timeout while waiting for events up to lock
2022-04-07 19:25:34 INFO Looking for magic cut-over table
2022-04-07 19:25:34 INFO Will now proceed to drop magic table and unlock tables
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping magic cut-over table
2022-04-07 19:25:34 INFO Dropping table `gallery`.`_time_info__tablet_1923_20220407192522_del`
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 13s(total), 8s(copy); streamer: mysql-bin.037492:232432138; Lag: 4.30s, HeartbeatLag: 4.80s, State: migrating; ETA: due
Copy: 54059/54059 100.0%; Applied: 0; Backlog: 0/1000; Time: 14s(total), 8s(copy); streamer: mysql-bin.037492:236853172; Lag: 5.30s, HeartbeatLag: 5.80s, State: migrating; ETA: due

solution

Which fun to create sentry table, drop the sentry table should to be in the same fun, Is it right? https://github.com/github/gh-ost/pull/1180

lmtwga avatar Aug 26 '22 02:08 lmtwga

@shlomi-noach Please review when you have time the PR: https://github.com/github/gh-ost/pull/1180

lmtwga avatar Aug 26 '22 03:08 lmtwga

@shlomi-noach Please review when you have time, thank you very much!

lmtwga avatar Aug 29 '22 06:08 lmtwga

@timvaillancourt @dm-2 @jsoref Please review when you have time, thank you very much!

lmtwga avatar Sep 01 '22 09:09 lmtwga

:wave: @lmtwga, thanks for the issue and PR!

Nice debugging! Just to summarise what's happening:

  1. If atomicCutOver function exits because of an error (i.e. before true is sent to the okToUnlockTable channel), it executes its defer block whilst AtomicCutOverMagicLock function still holds the table lock that includes the sentry table
  2. The defer block for atomicCutOver sends true to the okToUnlockTable channel, causing a race to the dropCutOverSentryTableOnce-protected drop of the sentry table in both the atomicCutOver and AtomicCutOverMagicLock functions
  3. If atomicCutOver wins the race, the drop table is blocked due to the lock held by AtomicCutOverMagicLock, and AtomicCutOverMagicLock is blocked because sync.Once.Do(f) blocks until the first execution of f completes successfully (which is the drop table that is blocked) - causing a deadlock until either database connection times out or is disconnected
  4. If AtomicCutOverMagicLock wins the race, the sentry table is dropped successfully and no deadlock occurs

The deadlock only happens after we've failed the cut-over due to an error, so it has no impact on successful gh-ost migrations.

The impact of this deadlock scenario is that it will delay the retry of failed cut-overs if atomicCutOver wins the race detailed in (2) above.

dm-2 avatar Sep 06 '22 17:09 dm-2

@dm-2 Thank you very much for your reply! I totally agree with what you said above. In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once. https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L943-L947 The lock will block all subsequent operations. For example, as shown in the figure below: image

the operations in the green boxes are blocked.

lmtwga avatar Sep 16 '22 09:09 lmtwga

@dm-2 Thank you very much for your reply! I totally agree with what you said above. In fact, if the above situation occurs, it may affect the online service. Because in AtomicCutOverMagicLock, two tables are locked at once.

https://github.com/github/gh-ost/blob/3f44e04365725e2d52ef05bdb997077a4d07abe1/go/logic/applier.go#L943-L947

The lock will block all subsequent operations. For example, as shown in the figure below: image the operations in the green boxes are blocked.

Good point! I'll take a look at the PR again :+1:

dm-2 avatar Oct 21 '22 16:10 dm-2

Fixed by https://github.com/github/gh-ost/pull/1180

meiji163 avatar Mar 16 '23 16:03 meiji163