fix: bundledeployment install failures should eventually settle
If a BD managed by the plain provisioner fails to install or upgrade, the plain provisioner performs a rollback to restore the original state before the install/upgrade was tried.
When these errors occur, the controller enters an exponential backoff, and subsequent install/upgrade attempts are made. Initial backoff attempts can start quickly enough that the rollback from the previous failed attempt has not yet fully completed in the background. For example, a namespace that was deleted in a previous rollback may still be terminating by the time the next reconcile attempt happens.
In these cases, we expect the exponential backoff to eventually wait long enough for these cleanup issues to have resolved themselves such that continued attempts stop interfering with the cleanup from the previous attempt.
We report these errors in the BD "Installed" condition. However this presents a problem. When the status changes (from real failure to ephemeral failure, a controller handler requeues the BD for immediate reconciliation. This partially foils the exponential backoff. What results is 3 reconcile attempts every backoff attempt (rather than the expected 1):
- Attempt 1: fails for the original reason, provisioner rollsback the failure, and updates the BD status to report the failure reason.
- Attempt 2: (happens immediately due to status update) fails again, this time due to transient issue with cleanup of attempt 1, provisioner rollsback the failure, and again updates the BD status, this time to report the transient error.
- Attempt 3: (again happens immediately due to another status update) fails again for the same reason as attempt 2, but this time the failure message is the same, so no status update and no immediate reconcile. After attempt 3, the backoff timer is once again in effect.
The reason this is problematic is that the BD spends most of its time with the status set to the result of attempt 3, and this gives users no actionable information about how to resolve the failure.
To solve this, this commit introduces waits (with 30 second timeouts) to enable to the failed install or upgrade to be cleaned up prior to the reconcile function returning, which means that subsequent reconcile attempts will generally (unless the cleanup timed out) be starting from a clean slate.
The introduction of these waits also means we need to run multiple controller threads so that waits are less likely to completely starve the controller of making progress against its reconcile queue.
Changes look good to me. Can I have the example bundle to see this issue?
Talked about this in slack. See https://kubernetes.slack.com/archives/C038B7MF75M/p1667597668853099 for more information.
/hold
I think I'm going to refactor this to use https://github.com/operator-framework/helm-operator-plugins/pull/196 when it merges.
/hold cancel
I've refactored this to make use of the updates that landed in https://github.com/operator-framework/helm-operator-plugins/pull/196
Is it safe to let multiple controller thread? I don't have any specific points but should we go through code to make sure it's OK? We probably should add some statements in the doc for contributors to the BD controller must be thread safe.
This PR has become stale because it has been open for 30 days with no activity. Please update this PR or remove the lifecycle/stale label before it is automatically closed in 30 days. Adding the lifecycle/frozen label will cause this PR to ignore lifecycle events.
This PR has been closed as no updates were detected after 30 days of being stale. Please feel free to reopen this PR if necessary.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
In a discussion recently, I suggested that we may not want to attempt uninstall/rollback when install/upgrade fails. If we stopped doing that, I think this PR would become obsolete, because it is caused by the uninstall/rollback/retry loop.