gh-ost icon indicating copy to clipboard operation
gh-ost copied to clipboard

Question: Primary instance restarts

Open jlindenbaum opened this issue 2 years ago • 3 comments

Context

We're running a larger migration of a data type int -> bigint on a table (1.4bn rows / 2TB size). Running on a CloudSQL MySQL 5.7 primary (--allow-on-master with one replica). Replica has parallel replication enabled. The instance is now scheduled for maintenance by google.

Command we're running

gh-ost --max-load=Threads_running=25 \
    --critical-load=Threads_running=300 \
    --chunk-size=1000 \
    --throttle-control-replicas=replica_ip \
    --user="ghost-replica" \
    --password=xxx \
    --host=primary_ip \
    --allow-on-master \
    --database="integrations" \
    --table="sales" \
    --verbose \
    --throttle-flag-file="/tmp/ghost-throttle-pos" \
    --postpone-cut-over-flag-file="/tmp/ghost-postpone-pos" \
    --panic-flag-file="/tmp/ghost-panic-pos" \
    --alter="MODIFY COLUMN id bigint(20) UNSIGNED NOT NULL AUTO_INCREMENT" \
    --execute

Problem

Initial testing of this scenario showed that if a migration was running showed that if the primary restarted, gh-ost would not reconnect and show errors about its host connection going away, and would no longer migrate. The program never exited, it would just sit with no more updates occuring.

Questions

  • Is this avoidable?
  • Can I change something in my interactive shell that would re-start the connection to the host?
  • Is the cutover dependent on replica lag? Is a viable workaround (to speed this up) to ignore our replica lag, and let the copy "run wild" with no consideration of replica lag, reducing our days to completion? Knowing that will mean a multi-day replication lag on the replica that we can tolerate for a short time.

While we can postpone the maintenance window of the database, we are doubtful the migration will finish in time. ~45% complete, ETA ~30 days to complete.

In the mean time we are trying any optimization to reduce writes to our DB, to relieve the bin log and hope the migration is throttled less.

Thanks in advance!

jlindenbaum avatar Aug 25 '22 17:08 jlindenbaum

@jlindenbaum while there are a few cases where queries are retried (in case of timeout, etc), gh-ost usually errs on the side of caution when there is a failure in the migration as there are so many scenarios to recover from

Some known scenarios where the migration will fail:

  • Loss of the "applier" (replica) or "inspector" (primary)
  • Orchestrator failover (read_only=ON on the primary)
  • Certain server-level errors from MySQL
  • Certain network-level errors, usually after retries

I feel the approach of erring on the side of caution is a fair trade off for being sure. That said, I think a lot can be improved to narrow this list of scenarios. Unfortunately, a major blocker for that is adding GTID suport to gh-ost (currently file-based binlog positions only), which I have only partially completed. This would allow gh-ost to be sure it is getting the exact events/point-in-time it needs in a number of failure scenarios (network problems, failover events, etc)

I'm a bit surprised to hear the MySQL connections don't reconnect when the host comes up, however. I wouldn't be surprised if there were failures AFTER it was able to reconnect, but not that 🤔. It could be a bug/misconfiguration in our driver/logic or possibly a DNS caching problem. Does the IP of the DNS entry change on CloudSQL restart/maintenance?

timvaillancourt avatar Aug 25 '22 19:08 timvaillancourt

Thanks for the reply. That makes a tonne of sense, erring on the side of caution.

In CloudSQL a restart or maintenance release (minor MySQL version bump), does not change the IP/DNS. This is usually a < 2 minute operation.

When we tested the reboot cycle, we did test a stop, wait, startup. We'll have to check if that just went beyond the retry of gh-ost. (I couldn't find any documentation around the number of retries. Does it back off, does it try x times in a minute, or something?)

If there is a retry mechanism, and I'm only proposing the restart of certain instances, would this be a viable feature request, to issue a retry over the socket? Something to get gh-ost back to where it was before the problem or the restart? Like, echo "reconnect-and-resume" | nc -U socket_path, or are you saying that until GTID support is in gh-ost that would not be a possibility at all, because gh-ost loses track of the bin log position?

Just curious if there may be some kind of work around mechanism that lets me "resume" a failed operation due to a network failure.

jlindenbaum avatar Aug 25 '22 20:08 jlindenbaum

@jlindenbaum no problem!

In CloudSQL a restart or maintenance release (minor MySQL version bump), does not change the IP/DNS. This is usually a < 2 minute operation.

Ok great, that rules out staleness in DNS 👍

When we tested the reboot cycle, we did test a stop, wait, startup. We'll have to check if that just went beyond the retry of gh-ost. (I couldn't find any documentation around the number of retries. Does it back off, does it try x times in a minute, or something?)

There are 2 x main retry helpers in gh-ost, one without exponential backoff and one with exponential backoff, so it depends on what area of the code the migration is at. It looks like we retry on any error whatsoever (ie: err != nil), which would include an error such as cannot connect to host or similar. The default max retries is 60. Perhaps you're hitting an area of the code that is not using a retry 🤔

One more CloudSQL question: are the binlog files preserved during maintenance? Ie: if we reconnect and ask for the last binlog file + position, will the server have it?

If there is a retry mechanism, and I'm only proposing the restart of certain instances, would this be a viable feature request, to issue a retry over the socket?

EDIT: probably yes. It sounds like gh-ost may be able to recover from this scenario without too much trouble assuming that doesn't introduce new risks we haven't thought of

timvaillancourt avatar Aug 27 '22 18:08 timvaillancourt