gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

Document true maximum cutover timeout length

Open hakusaro opened this issue 5 years ago • 4 comments

This is the commit message, soz for formatting in the PR:

When attempting a gh-ost migration, I was observed that gh-ost sets the lock timeout to 6 seconds, not 3 seconds, when attempting to lock tables. This is potentially dangerous, because a user may set this flag to 10 seconds and expect that the max table unavailability is a total of 10 seconds, when it's really 20 seconds. I observed this happening quite clearly, as gh-ost was unable to obtain the first stage lock several times. I didn't realize the docs didn't account for this total lock time.

(Now, they do!)

I totally get why you would want issues to accompany PRs, but I feel like this is an easily solved problem. I can make an issue too, but I don't think it will realistically matter too much other than potentially pushing the default value down

hakusaro avatar Dec 11 '20 01:12 hakusaro

@hakusaro thanks for this PR! I agree it doesn't really need an issue 👍

Were you able to explain why the value is doubled by gh-ost and where in the code this is happening? I'm curious if this behaviour is unintentional and/or a bug. Thanks!

timvaillancourt avatar Aug 18 '22 20:08 timvaillancourt

@timvaillancourt I don't quite remember how I arrived at this calculation, but it only happened because I was in the exact state described here: setting the timeout to e.g., 10 seconds doubled it to 20 seconds because if it fully locks up for the duration of time you specify and it backs off and has to try again, you end up paying the cost twice. I think it's somewhere like here but the context is kinda lost now.

For some context, at @polleverywhere we switched to gh-ost as an alternative to LHM specifically because we were having lock contention at the exact cutover point. When using LHM, this resulted in a 60-second table outage and eventually retrying would defeat the lock, but it was way too dangerous. gh-ost gave us a way to define the exact outage period. I definitely know this caused 6-second table outages with the smallest duration.

All I can say is that while gh-ost is graceful here, it's just slightly less graceful as unwinding the cutover takes longer than expected. I'd be happy to dive further in but I'm not an expert in the codebase, I just know what we had happen. It may have changed in the time since I sent this PR and now, but I'm not certain.

hakusaro avatar Aug 18 '22 21:08 hakusaro

It might be easier to find the cause by debugging and specifically inducing a lock contention problem (lock the table up, try to migrate, hit the cutover timeout, and see which code path is taken) and seeing how gh-ost behaves recovering. It works perfectly in our experience, it's just twice as long as we expected to get the table back into a non-outage state.

hakusaro avatar Aug 18 '22 21:08 hakusaro

@hakusaro ok thanks!

Will investigate to see if a bug fix could make this behave as expected 👍

timvaillancourt avatar Aug 21 '22 23:08 timvaillancourt