rukpak icon indicating copy to clipboard operation
rukpak copied to clipboard

fix: bundledeployment install failures should eventually settle

Open joelanford opened this issue 3 years ago • 8 comments

If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried.

When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens.

In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt.

We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1):

  • Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason.
  • Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error.
  • Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect.

The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure.

To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate.

The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue.

joelanford avatar Nov 04 '22 21:11 joelanford

Changes look good to me. Can I have the example bundle to see this issue?

akihikokuroda avatar Nov 07 '22 14:11 akihikokuroda

Talked about this in slack. See https://kubernetes.slack.com/archives/C038B7MF75M/p1667597668853099 for more information.

timflannagan avatar Nov 07 '22 14:11 timflannagan

/hold

I think I'm going to refactor this to use https://github.com/operator-framework/helm-operator-plugins/pull/196 when it merges.

joelanford avatar Nov 11 '22 14:11 joelanford

/hold cancel

I've refactored this to make use of the updates that landed in https://github.com/operator-framework/helm-operator-plugins/pull/196

joelanford avatar Nov 23 '22 14:11 joelanford

Is it safe to let multiple controller thread? I don't have any specific points but should we go through code to make sure it's OK? We probably should add some statements in the doc for contributors to the BD controller must be thread safe.

akihikokuroda avatar Nov 23 '22 18:11 akihikokuroda

This PR has become stale because it has been open for 30 days with no activity. Please update this PR or remove the lifecycle/stale label before it is automatically closed in 30 days. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.

github-actions[bot] avatar Dec 24 '22 00:12 github-actions[bot]

This PR has been closed as no updates were detected after 30 days of being stale. Please feel free to reopen this PR if necessary.

github-actions[bot] avatar Jan 23 '23 00:01 github-actions[bot]

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-robot avatar Mar 29 '23 03:03 openshift-merge-robot

In a discussion recently, I suggested that we may not want to attempt uninstall/rollback when install/upgrade fails. If we stopped doing that, I think this PR would become obsolete, because it is caused by the uninstall/rollback/retry loop.

joelanford avatar May 21 '24 17:05 joelanford