operator-controller
operator-controller copied to clipboard
🐛 Handle interrupted helm releases in applier
Description
Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state.
One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here.
For full context the discussion in this PR and: https://github.com/helm/helm/issues/5595
Reviewer Checklist
- [ ] API Go Documentation
- [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
- [ ] Comprehensive Commit Messages
- [ ] Links to related GitHub Issue(s)
Deploy Preview for olmv1 ready!
| Name | Link |
|---|---|
| Latest commit | 4e0d40aca824cbc1f7d3dfa394e2f56c90cfec3a |
| Latest deploy log | https://app.netlify.com/sites/olmv1/deploys/67cefcae3f725b000803d384 |
| Deploy Preview | https://deploy-preview-1776--olmv1.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
@joelanford WDYT? are there any additional considerations with this approach?
@azych, my main concern is whether helm allows any objects from the release manifest to be applied while still in the pending phase. If the answer is "yes", I'm concerned that rolling back could mean that some workload from the to version may have run some migration that makes it impossible to successfully go back to the from version.
Maybe the better option would be to:
- Leave the existing half-rolled-out release manifests objects that have already been applied in place on the cluster.
- Directly delete the release secret(s) (not
helm uninstall/rollback) that says it is pending. (We have a custom helm release storage driver that shards a release across multiple secrets if necessary to overcome etcd size limits). - Try again, the idea being that helm is able to essentially pick up where it left off without rolling back to the previous revision.
@joelanford I agree that this might be a valid concern in non-installation scenarios, though I also wonder that if we'd ever end up in an interrupted state that would prevent doing a successful rollback, couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead? I honestly don't feel knowledgeable enough about helm internals to be able to tell one way or the other at this point.
However, I followed your idea and so far just doing some basic testing it seems to be able to resolve the interruption regardless of the specific action that was interrupted (installation/upgrade).
Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?
I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:
- lack of permissions (something for the admin to fix)
- admission errors (generally not something OLMv1 can fix, but perhaps our current lack of using
helm upgrade --forceis an area that our behavior might affect ability to move forward) - (not a concern yet, since we don't implement health checking yet, but...) issues where we successfully apply changes, but then the resulting resources don't actually "work".
Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?
I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:
true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it, there is a chance the same broken state would prevent moving forward when we use a different workaround to deal with post-interruption situation. I'd differentiate a 'broken state' caused by interruption and a 'failed' state that can definitely happen
true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it
I see what you mean. My unsubstantiated hunch is that failures during upgrades will largely fall into two buckets
- Failure because everything was going smoothly and the process was interrupted (e.g. context was cancelled, connection with API server was lost, etc.).
- Failure because of anything else
IMO, (1) isn't really a failure. It's an indefinite pause to the progression of the upgrade that the helm client doesn't deal with well. I'm primarily concerned with recovering gracefully where we are able to simply pick up where we left off, and what I'm hoping is that there's a way to convince helm to do that.
My original proposal of "if the most recent release secret is pending, delete it" would rely on helm's adoption logic. What I would envision happening is that the next time through (after deleting the latest pending release), we would:
- See the "installed" release
- Do a dry-run upgrade, and detect a diff because the previous upgrade attempt did not complete its updates/patches
- Start a real upgrade
- Any objects with the same name/namespace in existing/upgrade releases that we applied successfully during the first attempt would likely see a no-op update/patch
- Any objects that are new in the upgrade release that we applied successfully during the first attempt would have the correct labels that would allow helm to "adopt" the object into the release
- Any objects that we didn't get to before the interruption would be created/updated as usual.
One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.
Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?
Consider this scenario:
v1is "deployed"v2upgrade started, but didn't finishv3is now available in the catalog, there is an upgrade edge fromv1, but notv2
Since we partially rolled out v2 (or maybe we fully rolled it out, but our network connection failed right as we were about to set the release secret to "deployed"), it seems like we should not proceed to v3 because we have to assume that whatever it is that makes v2 not upgradeable to v3 was included in the partial rollout.
What if the scenario changes where v3 upgrades from both v1 and v2? Would it be safe then to "resume" the upgrade but instead go to v3? I feel like there are scenarios there where the answer is still no.
For example: suppose there are two datasets managed by the operator's manifests, where:
- v1 has two workloads that individually migrate their respective datasets to level 1
- v2 has two workloads that individually migrate their respective datasets to level 2
- v3 expects the datasets to be at the same migration level, either both datasets at level 1 or both datasets at level 2, but not a mix.
If v2 partially rolls out such that one dataset gets migrated, but the other doesn't, then we can't migrate to v3. It seems reasonable that a v2 -> v3 upgrade edge has the implicit pre-condition that v2 is healthy (by our best possible definition of healthy).
Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.
Thoughts?
One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.
Looking at what actually happens within helm/pkg/action/upgrade.go:Upgrade(), once the in-flight release is created in storage with a 'pending' state (end of performUpgrade()) and the actual upgrade logic starts, in case there was an error at any step, that 'pending' state will be replaced by failRelease() and error will also be returned to the caller.
So while resuming the interrupted release does not mean it will succeed (or succeeded already), it should mean that it did not fail up to the point of interruption, although theoretically the interruption could still happen just before the 'pending' state is replaced and recorded, in which case failure will be 'masked' by 'pending' state.
Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle? ... Therefore, it seems like the only way to do this safely is to complete the upgrade to
v2before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resumev2, even if thev2bundle is no longer available.
Theoretically and if we can (and want) to detect those new upgrade paths I could also see attempting to do a rollback as a valid option. That way, if there were upgrade path changes, we can do a fresh start taking those into consideration. If not and assuming a rollback will be successful and is properly set up (not sure how concerned we should be if it is) we simply retry.
Thoughts?
My general thoughts about this go back to "available" workarounds for this issue that are present in the community (eg. reading those threads I linked to in the doc comment). What's repeatedly being mentioned there is:
- uninstall/rollback (my initial approach)
- pending release secret deletion With some people also bringing up manually editing 'pending' status to 'failure' or 'success' (if we can detect it)
At this point I don't really have a 'clear winner' here which I could say is the absolute best way to approach it with. All of those are really 'hacks' (maybe uninstall/rollback less so) that just work around the problem and the more we discuss it it seems we can come up with more and more scenarios where each of those workarounds might not play out perfectly and we might not be able to resolve the situation to a 'sane' state.
My question would be do we really need to ensure it? Should we dig deeper and hope we find a bulletproof solution (which I don't think there is given that the main problem isn't really on our side)? Maybe with the effort so far, we should just consider picking one of those "established" ideas to provide a "best effort" workarounds on our side? To be fair, I thought we kinda did that already with secret deletion https://github.com/operator-framework/operator-controller/pull/1776#discussion_r1961163821
BTW. There has been some activity around resolving this on the helm side (either in 3.x or 4), see: https://github.com/helm/helm/issues/11863 and https://github.com/helm/community/pull/354
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.57%. Comparing base (
dbd2bd9) to head (4e0d40a). Report is 83 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1776 +/- ##
==========================================
+ Coverage 68.45% 68.57% +0.11%
==========================================
Files 63 63
Lines 5136 5158 +22
==========================================
+ Hits 3516 3537 +21
- Misses 1390 1391 +1
Partials 230 230
| Flag | Coverage Δ | |
|---|---|---|
| e2e | 51.67% <64.00%> (+0.06%) |
:arrow_up: |
| unit | 56.18% <100.00%> (+0.16%) |
:arrow_up: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
:rocket: New features to boost your workflow:
- :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
thanks for the review @camilamacedo86, I'd still like to hear from @joelanford before potentially moving forward with those changes
/hold
@camilamacedo86 Thanks for the review! For now I'm going to back out the approval so that another maintainer doesn't come and merge this before we have a chance to have a conversation about how to proceed. Our decisions here have broad impact on the safety guarantees and automation that OLM can provide, so I just want to make sure that we all agree with the ultimate direction.
The outcome of a discussion should be enshrined in an RFC to make sure we can go back and find our reference point for the ultimate decision we make since this is so central to OLM behavior.
Thank you @joelanford
Yep, I would update this one again with
/lgtm cancel /approve cancel
Based on the discussion:
I was thinking that if the chart always ends up in 'Pending' for an unknown reason, we might have a problem if we don't have the ability to simply retry X times and after that set Failed. :-(
I was thinking that if the chart always ends up in 'Pending' for an unknown reason, we might have a problem if we don't have the ability to simply retry X times and after that set Failed. :-(
For something like that to happen (ie. release constantly ends up in 'pending-x' state), the process (generally our pod) would have to constantly keep being interrupted (eg. by deletion) during the same very specific narrow window of when we started helm action (installation/upgrade) and it did not finish because of the interruption, or more precisely helm did not manage to set a new status on the release other than pending - see https://github.com/operator-framework/operator-controller/pull/1776#issuecomment-2671051469).
Also per that linked comment and traced helm library code, whenever an error occurs during an action, pending state will be replaced with that error state and returned to the caller as a failed action with release status being set accordingly. There is a theoretical possibility of a failure happening and then an immediate interruption, which could cause relase to remain in 'pending' state even though there was a failure. With approach of deleting the in-flight release, we'd run into the same failed state during next reconciliation and won't lose anything.
I think it's also important to differentiate between helm release (k8s secret in our case) state and clusterextension state, which are two different things. Whenever a clusterextension ends up with an error during reconciliation, it will be retried and an error will be placed in 'Progressing' condition type in '.Status.Conditions'. This would be same with an error which comes from a pending helm action.
PR needs rebase.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
closing in favor of rebased version #1987 (I couldn't push the changes directly here =S)