sql: retry IMPORTs longer in cases where drain is encountered

Open ajstorm opened this issue 1 year ago • 0 comments

On the DRT cluster we're seeing an issue where when we drain nodes of the cluster to perform a rolling upgrade (or for general chaos testing) we will periodically get paused IMPORT jobs. This is due to the fact that the DRT cluster runs IMPORTs at ~1.5hr intervals, which take 30 minutes to complete. As a result, there's a high probability that IMPORTs will be running during a rolling upgrade of the 15 node cluster (which takes around an hour to complete). Once the IMPORTs enter paused state, the tables are taken offline, which causes cascading effects for the rest of the workload.

More specifically, the error we're hitting is:

pausing due to error; use RESUME JOB to try to proceed once the issue is resolved, or CANCEL JOB to rollback: exhausted retries: could not register flow because the registry is draining

Since the drain state is envisioned to be temporary (and since the recommendation is for customers to RESUME on their own), we should instead continue to retry on the assumption that the drain operation will eventually complete.

More context for this can be found in this thread.

Feb 23 '24 18:02 ajstorm