gh-ost
gh-ost copied to clipboard
ERROR Timeout while waiting for events up to lock
We have a pretty large and busy (at some times) table where we try to add column xxx int null
, everything seems fine until the cut-over phase, it gets an error and then retries for some time but never succeeds. I have checked other issues here which also have this error but haven't found anything really.
The binlog is not stuck as it keeps applying and rotating.
This is on RDS Aurora, normal setup, we haven't changed any parameters but starting to think that some parameter might need to be changed in the DB? But not sure which in that case.
We are running latest master and have tried some different command flags, like for ex --cut-over-exponential-backoff
and increasing --cut-over-lock-timeout-seconds
and --default-retries
.
This is the full command right now:
gh-ost \ --user=$user \ --password=$password \ --host=$host \ --database=$database \ --table=$table \ --alter="$alter" \ --max-load=Threads_running=25 \ --critical-load=Threads_running=1000 \ --chunk-size=1000 \ --max-lag-millis=1500 \ --verbose \ --alter="engine=innodb" \ --allow-master-master \ --exact-rowcount \ --concurrent-rowcount \ --default-retries=120 \ --allow-on-master \ --initially-drop-ghost-table \ --initially-drop-old-table \ --cut-over-lock-timeout-seconds=10 \ --assume-rbr \ --cut-over-exponential-backoff \ --execute \
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m10s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:1363370; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 ERROR 2019-11-06 14:04:27 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping magic cut-over table
2019-11-06 14:04:27 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Table dropped
2019-11-06 14:04:27 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables unlocked
2019-11-06 14:04:27 INFO Grabbing voluntary lock: gh-ost.428716346.lock
2019-11-06 14:04:27 INFO Setting LOCK timeout as 20 seconds
2019-11-06 14:04:27 INFO Looking for magic cut-over table
2019-11-06 14:04:27 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Magic cut-over table created
2019-11-06 14:04:27 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:27 INFO Tables locked
2019-11-06 14:04:27 INFO Session locking original & magic tables is 428716346
2019-11-06 14:04:27 INFO Writing changelog state: AllEventsUpToLockProcessed:1573049067764852856
2019-11-06 14:04:27 INFO Waiting for events up to lock
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m15s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:16394106; Lag: 0.01s, State: migrating; ETA: due
Copy: 79876601/79876601 100.0%; Applied: 16344; Backlog: 0/1000; Time: 3h42m20s(total), 2h20m24s(copy); streamer: mysql-bin-changelog.013410:32408062; Lag: 0.01s, State: migrating; ETA: due
2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 ERROR 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
2019-11-06 14:04:37 INFO Looking for magic cut-over table
2019-11-06 14:04:37 INFO Will now proceed to drop magic table and unlock tables
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping magic cut-over table
2019-11-06 14:04:37 INFO Dropping table `xxx`.`_xxx_del`
2019-11-06 14:04:37 INFO Table dropped
2019-11-06 14:04:37 INFO Removing socket file: /tmp/gh-ost.xxx.xxx.sock
2019-11-06 14:04:37 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-11-06 14:04:37 FATAL 2019-11-06 14:04:37 ERROR Timeout while waiting for events up to lock
This is how the _ghc table looks like if that helps:
So if I'm reading the code correctly it's waiting for that AllEventsUpToLockProcessed
entry with id: 2 to appear in the binlogs, but it seems it never does end up there for some reason?
any solution?
Nope, tried ALOT of different things, but always the same error :(
me too :(
Very sorry for the late response. Running with --debug --stack
will provide additional, useful information, if that is possible.
Okay, running this now. Will get back in ~2hrs when its on the cut-over phase
This is the log when it tries to do the cut-over, this repeats x amounts of times (I think I set the retry count to 120) Does this help or do you need logs from an earlier stage?
2019-12-09 12:19:51 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:52 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:53 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:54 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:55 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:56 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:57 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:58 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:19:59 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:00 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:01 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:02 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:03 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:04 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:05 DEBUG checking for cut-over postpone
2019-12-09 12:20:05 DEBUG checking for cut-over postpone: complete
2019-12-09 12:20:05 INFO Grabbing voluntary lock: gh-ost.451914278.lock
2019-12-09 12:20:05 INFO Setting LOCK timeout as 20 seconds
2019-12-09 12:20:05 INFO Looking for magic cut-over table
2019-12-09 12:20:05 INFO Creating magic cut-over table `xxx`.`_xxx_del`
2019-12-09 12:20:05 INFO Magic cut-over table created
2019-12-09 12:20:05 INFO Locking `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-12-09 12:20:05 INFO Tables locked
2019-12-09 12:20:05 INFO Session locking original & magic tables is 451914278
2019-12-09 12:20:05 INFO Writing changelog state: AllEventsUpToLockProcessed:1575894005304855457
2019-12-09 12:20:05 INFO Waiting for events up to lock
2019-12-09 12:20:05 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m50s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:71801235; Lag: 0.01s, State: migrating; ETA: due
2019-12-09 12:20:06 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:07 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:08 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:09 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:10 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m55s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:81706458; Lag: 0.11s, State: migrating; ETA: due
2019-12-09 12:20:11 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:12 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:13 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:14 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:15 ERROR Timeout while waiting for events up to lock
2019-12-09 12:20:15 ERROR 2019-12-09 12:20:15 ERROR Timeout while waiting for events up to lock
goroutine 1 [running]:
runtime/debug.Stack(0x5f, 0x100, 0xc0000ce0a0)
/usr/local/go/src/runtime/debug/stack.go:24 +0x9d
runtime/debug.PrintStack()
/usr/local/go/src/runtime/debug/stack.go:16 +0x22
github.com/github/gh-ost/vendor/github.com/outbrain/golib/log.logErrorEntry(0x2, 0x8a5f00, 0xc00007d7e0, 0xc0000c79d8, 0x1)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/vendor/github.com/outbrain/golib/log/log.go:178 +0xe7
github.com/github/gh-ost/vendor/github.com/outbrain/golib/log.Errore(...)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/vendor/github.com/outbrain/golib/log/log.go:224
github.com/github/gh-ost/go/logic.(*Migrator).atomicCutOver(0xc000124000, 0x0, 0x0)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:634 +0x9d3
github.com/github/gh-ost/go/logic.(*Migrator).cutOver(0xc000124000, 0xd4bee0, 0xd4bee0)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:524 +0x26b
github.com/github/gh-ost/go/logic.(*Migrator).retryOperationWithExponentialBackoff(0xc000124000, 0xc0002173e0, 0x0, 0x0, 0x0, 0x77ff01, 0xc0002173e0)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:169 +0xde
github.com/github/gh-ost/go/logic.(*Migrator).Migrate(0xc000124000, 0x0, 0x0)
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/logic/migrator.go:413 +0x8f8
main.main()
/home/ec2-user/gh-ost/.gopath/src/github.com/github/gh-ost/go/cmd/gh-ost/main.go:278 +0x2725
2019-12-09 12:20:15 INFO Looking for magic cut-over table
2019-12-09 12:20:15 INFO Will now proceed to drop magic table and unlock tables
2019-12-09 12:20:15 INFO Dropping magic cut-over table
2019-12-09 12:20:15 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`_xxx_del`
2019-12-09 12:20:15 INFO Tables unlocked
2019-12-09 12:20:15 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h16m0s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:92163787; Lag: 0.01s, State: migrating; ETA: due```
Thank you for the logs, they seem to be enough. This is weird:
2019-12-09 12:20:05 INFO Writing changelog state: AllEventsUpToLockProcessed:1575894005304855457
2019-12-09 12:20:05 INFO Waiting for events up to lock
2019-12-09 12:20:05 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m50s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:71801235; Lag: 0.01s, State: migrating; ETA: due
2019-12-09 12:20:06 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:07 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:08 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:09 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:10 DEBUG Getting nothing in the write queue. Sleeping...
Copy: 80807022/80807022 100.0%; Applied: 3; Backlog: 0/1000; Time: 3h15m55s(total), 2h13m16s(copy); streamer: mysql-bin-changelog.015006:81706458; Lag: 0.11s, State: migrating; ETA: due
2019-12-09 12:20:11 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:12 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:13 DEBUG Getting nothing in the write queue. Sleeping...
2019-12-09 12:20:14 DEBUG Getting nothing in the write queue. Sleeping...
what this means is that gh-ost
wrote an entry on the *_ghc
table on the master
, and never received the change on the binary log.
Are there any replication filters on this cluster?
Hmm, I googled abit on binlog filtering on AWS, and seems it's enabled by default.
The binlog filtering feature automatically reduces network bandwidth for replication messages. Because the Aurora Replicas don't use the binlog information that is included in the replication messages, that data is omitted from the messages sent to those nodes. You control this feature by changing the aurora_enable_repl_bin_log_filtering parameter. This parameter is on by default. Because this optimization is intended to be transparent, you might turn off this setting only during diagnosis or troubleshooting for issues related to replication. For example, you can do so to match the behavior of an older Aurora MySQL cluster where this feature was not available.
https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraMySQL.Replication.html#AuroraMySQL.Replication.Performance Will try tomorrow with this off and see what the result is
Unfortunately, after setting aurora_enable_repl_bin_log_filtering: 0
it is still the exact same logs and error :( Any more ideas? :)
No wait, it actually worked :) I wrote too early. After like 1 00 retries I got this:
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:23 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:24 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:25 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:26 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-10 10:32:27 INFO Handled changelog state AllEventsUpToLockProcessed```
And it did the cut-over
Are you running directly on the master
? (--host
flag). If not, can you do that?
Ha ok!
Haha :) Finally, nice catch about that binlog filtering thing @shlomi-noach :) Should maybe be added to the RDS docs?
Should maybe be added to the RDS docs?
@knepe sure, it should. Now, I am not using RDS/Aurora myself and have little understanding of the configuration involved. Would you be willing to create a pull request?
Yes, sure I'll make a PR :) Closing this and thanks for your help!
Re-opening this, because I've had random success with this unfortunately, so I'm not sure this parameter actually helps. I've ran the same migration on 3 clusters now with that parameter set to 0, and it worked for 2/3 clusters. I have checked the parameters and they are the same for all clusters. So I'm not sure if I got lucky on those 2 clusters where it worked. Also ran a much smaller migration on one of the clusters that the original migration worked on, but it is also failing now. The error and logs are the same as before.
What I've found, when it DO work, it goes into some state where it spams this:
2019-12-17 10:27:47 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:47 ERROR 2019-12-17 10:27:47 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:47 INFO Looking for magic cut-over table
2019-12-17 10:27:47 INFO Will now proceed to drop magic table and unlock tables
2019-12-17 10:27:47 INFO Dropping magic cut-over table
2019-12-17 10:27:47 INFO Dropping magic cut-over table
2019-12-17 10:27:47 INFO Dropping table `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Table dropped
2019-12-17 10:27:48 INFO Tables unlocked
2019-12-17 10:27:48 INFO Grabbing voluntary lock: gh-ost.1339250.lock
2019-12-17 10:27:48 INFO Setting LOCK timeout as 6 seconds
2019-12-17 10:27:48 INFO Looking for magic cut-over table
2019-12-17 10:27:48 INFO Creating magic cut-over table `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Magic cut-over table created
2019-12-17 10:27:48 INFO Locking `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:48 INFO Tables locked
2019-12-17 10:27:48 INFO Session locking original & magic tables is 1339250
2019-12-17 10:27:48 INFO Writing changelog state: AllEventsUpToLockProcessed:1576578468842536707
2019-12-17 10:27:48 INFO Waiting for events up to lock
2019-12-17 10:27:51 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:51 ERROR 2019-12-17 10:27:51 ERROR Timeout while waiting for events up to lock
2019-12-17 10:27:51 INFO Looking for magic cut-over table
2019-12-17 10:27:51 INFO Will now proceed to drop magic table and unlock tables
2019-12-17 10:27:51 INFO Dropping magic cut-over table
2019-12-17 10:27:51 INFO Dropping magic cut-over table
2019-12-17 10:27:51 INFO Dropping table `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Table dropped
2019-12-17 10:27:52 INFO Releasing lock from `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Tables unlocked
2019-12-17 10:27:52 INFO Grabbing voluntary lock: gh-ost.1339247.lock
2019-12-17 10:27:52 INFO Setting LOCK timeout as 6 seconds
2019-12-17 10:27:52 INFO Looking for magic cut-over table
2019-12-17 10:27:52 INFO Creating magic cut-over table `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Magic cut-over table created
2019-12-17 10:27:52 INFO Locking `xxx`.`xxx`, `xxx`.`xxx_del`
2019-12-17 10:27:52 INFO Tables locked
2019-12-17 10:27:52 INFO Session locking original & magic tables is 1339247
2019-12-17 10:27:52 INFO Writing changelog state: AllEventsUpToLockProcessed:1576578472423229562
2019-12-17 10:27:52 INFO Waiting for events up to lock
If you check the timestamp, it becomes very aggresive, and after it has spammed this for awhile, it eventually starts spamming these logs:
2019-12-17 10:28:37 INFO Intercepted changelog state AllEventsUpToLockProcessed
2019-12-17 10:28:37 INFO Handled changelog state AllEventsUpToLockProcessed
and it will complete the migration after this.
FYI: I managed to solve the failing migration by increasing retries to like 5000. Then it eventually worked after like 1000+ retries. Does not seem that stable though
any news? i got the same problem~ it seems that it happens on large table, in my situation, the table is 58GB, after failed running, i dropped the _gho table manually, it took nearly 20seconds, is it related to cut-over phase?
I'm also seeing this on AWS Aurora (bin log filtering disabled, running on master). Works sometimes not others. I don't think it is a timeout issue related to cutover because I have it set at 10 seconds and when I do it manually it only take ~5 secs. This was a 1 hour process. Don't really want to try it on a much longer one if I am going to get to the end and have it fail.
Update: Looks like my issue was related to the exact-rowcount option issue mentioned here https://github.com/github/gh-ost/issues/830
Any news? in my situation,the table is 1.5T, and cut-over phase same like this.
/opt/gh-ost --conf='/opt/ghost.conf'
--database=test
--table=oh_order_sales
--alter='CHANGE COLUMN ORDER_ID ORDER_ID VARCHAR(150)'
--max-load=Threads_running=100
--critical-load=Threads_running=1000
--critical-load-interval-millis=100
--chunk-size=100000
--host=172.20.134.3
--port=3306
--verbose
--cut-over=default
--default-retries=10
--allow-on-master
--initially-drop-ghost-table
--dml-batch-size=100
--assume-rbr
--assume-master-host=172.20.134.3:3306
--timestamp-old-table
--panic-flag-file=/tmp/sbtest1.panic
--throttle-flag-file=/tmp/sbtest1.throttle
--postpone-cut-over-flag-file=/tmp/sbtest1.postpone
--serve-socket-file=/tmp/sbtest1.gh-ost.socket
--stack
--debug
--execute 2>&1 | tee gh-ost_execute.log
My company ran into the exact same problem while using gh-ost
to run migrations against our Aurora clusters.
After a lot of deep diving, we finally put together this PR which makes the problem visible and protects against cutOver failures.
Now when we run migrations against Aurora, we can see that the value of HeartbeatLag
grows larger and larger during a long running migration on a busy table:
Copy: 23000/2172914 1.1%; Applied: 0; Backlog: 0/1000; Time: 1m10s(total), 3s(copy); streamer: mysql-bin-changelog.046700:117936636; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 4m41s
Copy: 75000/2172914 3.5%; Applied: 0; Backlog: 0/1000; Time: 1m15s(total), 8s(copy); streamer: mysql-bin-changelog.046700:128606180; Lag: 0.12s, HeartbeatLag: 0.12s, State: migrating; ETA: 3m44s
Copy: 128500/2172914 5.9%; Applied: 0; Backlog: 0/1000; Time: 1m20s(total), 13s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 2.52s, State: migrating; ETA: 3m26s
Copy: 179000/2172914 8.2%; Applied: 0; Backlog: 0/1000; Time: 1m25s(total), 18s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 7.52s, State: migrating; ETA: 3m20s
Copy: 228000/2172914 10.5%; Applied: 0; Backlog: 0/1000; Time: 1m30s(total), 23s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.12s, HeartbeatLag: 12.52s, State: migrating; ETA: 3m16s
Copy: 279500/2172914 12.9%; Applied: 0; Backlog: 0/1000; Time: 1m35s(total), 28s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 17.52s, State: migrating; ETA: 3m9s
Copy: 330000/2172914 15.2%; Applied: 0; Backlog: 0/1000; Time: 1m40s(total), 33s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 22.52s, State: migrating; ETA: 3m4s
Copy: 378000/2172914 17.4%; Applied: 0; Backlog: 0/1000; Time: 1m45s(total), 38s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 27.52s, State: migrating; ETA: 3m0s
Copy: 427000/2172914 19.7%; Applied: 0; Backlog: 0/1000; Time: 1m50s(total), 43s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 32.52s, State: migrating; ETA: 2m55s
Copy: 477500/2172914 22.0%; Applied: 0; Backlog: 0/1000; Time: 1m55s(total), 48s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 37.52s, State: migrating; ETA: 2m50s
Copy: 530000/2172914 24.4%; Applied: 0; Backlog: 0/1000; Time: 2m0s(total), 53s(copy); streamer: mysql-bin-changelog.046700:134279989; Lag: 0.02s, HeartbeatLag: 42.52s, State: migrating; ETA: 2m44s
By the time we get to the cut-over stage, HeartbeatLag
can be minutes. This means that there is absolutely no chance that cut-over could succeed, and instead, it will timeout after 3 seconds with ERROR Timeout while waiting for events up to lock
. This will then retry up to 60 times.
In the PR that we have submitted, we simply wait before cutOver for the HeartbeatLag
to reduce to a safe level before continuing. This looks as follows:
Row copy complete
Waiting for heartbeat lag to be low enough to proceed
current HeartbeatLag (44.16s) is too high, it needs to be less than --max-lag-millis (1.50s) to continue
Eventually, HeartbeatLag
will reduce enough to proceed with cutOver succeeded on the first try.
Heartbeat lag is low enough, proceeding
Update: Because of the visibility into HeartbeatLag
that this PR enabled, we were able to experiment with a number of Aurora parameters and eventually figure out why HeartbeatLag
was growing so quickly.
On our clusters, the parameter aurora_binlog_replication_max_yield_seconds
defaulted to 60 seconds. By changing this value to 0 (on a test cluster), we were able to completely remove the HeartbeatLag
that we were seeing.
Instead, we started to see the following while running migrations:
Copy: 2342000/2256823 103.8%; Applied: 0; Backlog: 0/1000; Time: 4m0s(total), 3m51s(copy); streamer: mysql-bin-changelog.046761:69254547; Lag: 0.11s, HeartbeatLag: 0.11s, State: migrating; ETA: due
2021-02-03 13:57:36 INFO Row copy complete
Copy: 2401160/2401160 100.0%; Applied: 0; Backlog: 0/1000; Time: 4m5s(total), 3m56s(copy); streamer: mysql-bin-changelog.046761:81901819; Lag: 0.01s, HeartbeatLag: 0.07s, State: migrating; ETA: due
2021-02-03 13:57:36 INFO Waiting for heartbeat lag to be low enough to proceed
2021-02-03 13:57:36 INFO Heartbeat lag is low enough, proceeding
This parameter is described in Aurora's Optimizing binary log replication doc.
Warning: We have not changed this parameter in our production environment as it may not be safe for us to do so. We are actually leaning towards keeping this value at its default of 60 seconds and instead relying on the safety added to gh-ost
by this PR.
@knepe https://github.com/github/gh-ost/pull/921 is now merged to master
. Can you confirm this fixes the issue?
Should https://github.com/github/gh-ost/pull/805 be reverted?
Should #805 be reverted?
@bbuchalter what is the context for reverting #805? It's an Aurora documentation PR?
@shlomi-noach it's based on this comment:
Re-opening this, because I've had random success with this unfortunately, so I'm not sure this parameter actually helps.
FWIW, Gusto does not use this advice and has not encountered this issue. It seems this may not be a hard requirement?
All, I have no insight into this as I'm not using RDS myself.
As a data point, Gusto uses RDS and we do not have aurora_enable_repl_bin_log_filtering
set to zero while using gh-ost.