gh-ost WIP: Resurrect

Storyline: https://github.com/github/gh-ost/issues/205

WORK IN PROGRESS: resurrecting a migration after failure. The idea is that gh-ost would routinely dump migration status/context. It would be ossible for one gh-ost process to fail (e.g. having met critical-load) and for another gh-ost process to pick up from where the first left off.

Initial commits present exporting of migration context, with some shuffling & cleanup.

Dec 20 '16 13:12 shlomi-noach

TODO:

[x] must not export passwords

Dec 20 '16 14:12 shlomi-noach

Export is to changelog table. This ensures atomicy and durability of write, assuming changelog table is InnoDB. notable is that if migrated table is MyISAM, so is the changelog table.

I'm fine stating that resurrection does not work on MyISAM, because MyISAM.

Dec 20 '16 14:12 shlomi-noach

https://github.com/github/gh-ost/pull/343/commits/5f25f741ad22e0d234376333e8c7688b071f16b1 makes for something that works! I'll need to iterate to see what has been overlooked, but basically we're getting there fast.

Dec 21 '16 15:12 shlomi-noach

A concern is to not rely on the streamer's last known position, because streamer writes to a buffer (currently hard coded to 100 events). Those events would be lost upon resurrection.

[x] Instead, we should have the migration report the last applied event's coordinates.

Dec 23 '16 07:12 shlomi-noach

Off issue, you were mentioning gh-ost was having checksum issues with resurrections. When you mentioned that I was thinking, could it be related to the fact that we have two things going on: the backlog and the iteration of inserts? I hope that makes sense. None the less, it was something that popped in my head that I hoped might help when you get back to this. (Not fully understanding the code changes, this might already be something you're handling.)

Mar 02 '17 16:03 tomkrouper

@tomkrouper the conjecture is as follows:

assuming gh-ost breaks while copying rows 5,000-5,100
and while reading mysql-bin.000120 at position 123456

it should be OK to resume execution

start with copying rows 4,300-4,400 (way before the point of breakage)
start with reading binary log mysql-bin.000120 at position 121234 (way before the point of breakage)

this is the conjecture's logic:

re-copying same rows just overwrites existing rows (or adds rows that weren't there before!)
re-applying binary logs is an idempotent action I find it a bit difficult right now to substantiate these claims, but I believe them to be true.

But then, of course, tests are failing...

Mar 02 '17 16:03 shlomi-noach

@shlomi-noach are there any plans to revisit this feature? I'm looking at gh-ost again and one of the concerns my team has is that we have some very large tables that can take days, if not a week to copy. If the process were to crash in the middle we'd have a lot of wasted effort, especially when we have to slowly drain the _gho table to prevent the dreaded global metadata lock when dropping it. This would be extremely helpful for us!

Jun 29 '18 14:06 Xopherus

@Xopherus this isn't on the near future's roadmap. FWIW, we are likewise running week long, or even 22-day long migration in once case. We use

-critical-load-hibernate-seconds=3600

such that hitting critical load doesn't bail out.

I understand the stress involved with running a week long migration. Our history shows those migrations do not break, hence the Resurrection feature is not urgent for us to implement.

Jul 01 '18 05:07 shlomi-noach

Thanks for the advice @shlomi-noach! Appreciate the wisdom - I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long. I'll have to try that parameter and let you know how it goes.

Jul 07 '18 20:07 Xopherus

I've found that tuning gh-ost is one of the challenges because the feedback cycles are so long

@Xopherus Could you please elaborate on that? I'm not sure I understand.

Jul 08 '18 04:07 shlomi-noach

Oh I just mean that if your migrations can take multiple hours / days, it can be tricky to tune parameters (e.g. critical load threshold or lock cutover timeouts, etc) because it takes longer to experiment. Fortunately we've gotten solid advice from you and others here to help guide us in the right direction.

Jul 11 '18 02:07 Xopherus

Hi @shlomi-noach :-) I think "resurrect" is not the best term. It's not a standard technical term. Even doc/command-line-flags.md has to clarify: "It is possible to resurrect/resume a failed migration". When people think, "Can I resume an osc?", they'll look for and Google with that term. Imho, "resurrect" will never cross people's minds. By contrast, everyone knows what "resume" (and its reciprocals "suspend" or "pause") mean. I'd also argue that it's not technically descriptive or intention-revealing. A dead body can be resurrected, and I get the joke with the app being called "ghost", but it begs the question: What does it mean to resurrect a program? My last argument is: for non-native English speakers/readers, these issues are compounded by uncommon words in a technical context. -- I'd vote for pause/resume or start/stop.

Feb 13 '19 18:02 daniel-nichter

Thank you @daniel-nichter

Feb 14 '19 05:02 shlomi-noach

bumping this feature request to check on if there were any changes to make this feature available?

Sep 09 '20 21:09 rakhi-s

bumping this feature request to check on if there were any changes to make this feature available?

This code is fairly old and there are a bunch of conflicting files at this point. We don't have any immediate plans to work on this, but I do agree this would be a good feature to have and if anyone would like to continue the work, we'd love the community contribution.

Sep 09 '20 21:09 tomkrouper

gh-ost gh-ost copied to clipboard

WIP: Resurrect

gh-ost
gh-ost copied to clipboard