gh-ost
gh-ost copied to clipboard
WIP: Resurrect
Storyline: https://github.com/github/gh-ost/issues/205
WORK IN PROGRESS: resurrecting a migration after failure.
The idea is that gh-ost
would routinely dump migration status/context. It would be ossible for one gh-ost
process to fail (e.g. having met critical-load
) and for another gh-ost
process to pick up from where the first left off.
Initial commits present exporting of migration context, with some shuffling & cleanup.
TODO:
- [x] must not export passwords
Export is to changelog table. This ensures atomicy and durability of write, assuming changelog table is InnoDB
. notable is that if migrated table is MyISAM
, so is the changelog table.
I'm fine stating that resurrection does not work on MyISAM
, because MyISAM
.
https://github.com/github/gh-ost/pull/343/commits/5f25f741ad22e0d234376333e8c7688b071f16b1 makes for something that works! I'll need to iterate to see what has been overlooked, but basically we're getting there fast.
A concern is to not rely on the streamer's last known position, because streamer writes to a buffer (currently hard coded to 100
events). Those events would be lost upon resurrection.
- [x] Instead, we should have the migration report the last applied event's coordinates.
Off issue, you were mentioning gh-ost was having checksum issues with resurrections. When you mentioned that I was thinking, could it be related to the fact that we have two things going on: the backlog and the iteration of inserts? I hope that makes sense. None the less, it was something that popped in my head that I hoped might help when you get back to this. (Not fully understanding the code changes, this might already be something you're handling.)
@tomkrouper the conjecture is as follows:
- assuming
gh-ost
breaks while copying rows5,000
-5,100
- and while reading
mysql-bin.000120
at position123456
it should be OK to resume execution
- start with copying rows
4,300
-4,400
(way before the point of breakage) - start with reading binary log
mysql-bin.000120
at position121234
(way before the point of breakage)
this is the conjecture's logic:
- re-copying same rows just overwrites existing rows (or adds rows that weren't there before!)
- re-applying binary logs is an idempotent action I find it a bit difficult right now to substantiate these claims, but I believe them to be true.
But then, of course, tests are failing...
@shlomi-noach are there any plans to revisit this feature? I'm looking at gh-ost again and one of the concerns my team has is that we have some very large tables that can take days, if not a week to copy. If the process were to crash in the middle we'd have a lot of wasted effort, especially when we have to slowly drain the _gho table to prevent the dreaded global metadata lock when dropping it. This would be extremely helpful for us!
@Xopherus this isn't on the near future's roadmap. FWIW, we are likewise running week long, or even 22-day long migration in once case. We use
-critical-load-hibernate-seconds=3600
such that hitting critical load doesn't bail out.
I understand the stress involved with running a week long migration. Our history shows those migrations do not break, hence the Resurrection feature is not urgent for us to implement.
Thanks for the advice @shlomi-noach! Appreciate the wisdom - I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long. I'll have to try that parameter and let you know how it goes.
I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long
@Xopherus Could you please elaborate on that? I'm not sure I understand.
Oh I just mean that if your migrations can take multiple hours / days, it can be tricky to tune parameters (e.g. critical load threshold or lock cutover timeouts, etc) because it takes longer to experiment. Fortunately we've gotten solid advice from you and others here to help guide us in the right direction.
Hi @shlomi-noach :-) I think "resurrect" is not the best term. It's not a standard technical term. Even doc/command-line-flags.md has to clarify: "It is possible to resurrect/resume a failed migration". When people think, "Can I resume an osc?", they'll look for and Google with that term. Imho, "resurrect" will never cross people's minds. By contrast, everyone knows what "resume" (and its reciprocals "suspend" or "pause") mean. I'd also argue that it's not technically descriptive or intention-revealing. A dead body can be resurrected, and I get the joke with the app being called "ghost", but it begs the question: What does it mean to resurrect a program? My last argument is: for non-native English speakers/readers, these issues are compounded by uncommon words in a technical context. -- I'd vote for pause/resume or start/stop.
Thank you @daniel-nichter
bumping this feature request to check on if there were any changes to make this feature available?
bumping this feature request to check on if there were any changes to make this feature available?
This code is fairly old and there are a bunch of conflicting files at this point. We don't have any immediate plans to work on this, but I do agree this would be a good feature to have and if anyone would like to continue the work, we'd love the community contribution.