operator-controller icon indicating copy to clipboard operation
operator-controller copied to clipboard

🐛 Handle interrupted helm releases in applier

Open azych opened this issue 9 months ago • 15 comments

Description

Introduces a workaround for 'interrupted' helm releases which enter into a 'pending' (-install/uninstall/rollback) state. If that happens, for example because of immediate application exit with one of those operations being in flight, helm is not able to resolve it automatically which means we end up in a permanent reconcile error state.

One of the workarounds for this that has been repeated in the community is to remove metadata of the pending release, which is the solution chosen here.

For full context the discussion in this PR and: https://github.com/helm/helm/issues/5595

Reviewer Checklist

  • [ ] API Go Documentation
  • [ ] Tests: Unit Tests (and E2E Tests, if appropriate)
  • [ ] Comprehensive Commit Messages
  • [ ] Links to related GitHub Issue(s)

azych avatar Feb 14 '25 12:02 azych

Deploy Preview for olmv1 ready!

Name Link
Latest commit 4e0d40aca824cbc1f7d3dfa394e2f56c90cfec3a
Latest deploy log https://app.netlify.com/sites/olmv1/deploys/67cefcae3f725b000803d384
Deploy Preview https://deploy-preview-1776--olmv1.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Feb 14 '25 12:02 netlify[bot]

@joelanford WDYT? are there any additional considerations with this approach?

azych avatar Feb 14 '25 12:02 azych

@azych, my main concern is whether helm allows any objects from the release manifest to be applied while still in the pending phase. If the answer is "yes", I'm concerned that rolling back could mean that some workload from the to version may have run some migration that makes it impossible to successfully go back to the from version.

Maybe the better option would be to:

  1. Leave the existing half-rolled-out release manifests objects that have already been applied in place on the cluster.
  2. Directly delete the release secret(s) (not helm uninstall/rollback) that says it is pending. (We have a custom helm release storage driver that shards a release across multiple secrets if necessary to overcome etcd size limits).
  3. Try again, the idea being that helm is able to essentially pick up where it left off without rolling back to the previous revision.

joelanford avatar Feb 14 '25 20:02 joelanford

@joelanford I agree that this might be a valid concern in non-installation scenarios, though I also wonder that if we'd ever end up in an interrupted state that would prevent doing a successful rollback, couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead? I honestly don't feel knowledgeable enough about helm internals to be able to tell one way or the other at this point.

However, I followed your idea and so far just doing some basic testing it seems to be able to resolve the interruption regardless of the specific action that was interrupted (installation/upgrade).

azych avatar Feb 17 '25 11:02 azych

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

  • lack of permissions (something for the admin to fix)
  • admission errors (generally not something OLMv1 can fix, but perhaps our current lack of using helm upgrade --force is an area that our behavior might affect ability to move forward)
  • (not a concern yet, since we don't implement health checking yet, but...) issues where we successfully apply changes, but then the resulting resources don't actually "work".

joelanford avatar Feb 17 '25 16:02 joelanford

Couldn't the same/similar state issue prevent us from actually progressing with a smooth upgrade when we remove release state info via secret(s) deletion instead?

I think there is always the possibility of failing to move forward. The primary cases I can think of for that are:

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it, there is a chance the same broken state would prevent moving forward when we use a different workaround to deal with post-interruption situation. I'd differentiate a 'broken state' caused by interruption and a 'failed' state that can definitely happen

azych avatar Feb 18 '25 09:02 azych

true, but what I meant was that if an interruption leaves an in-flight action in such a broken state that we wouldn't be able to rollback because of it

I see what you mean. My unsubstantiated hunch is that failures during upgrades will largely fall into two buckets

  1. Failure because everything was going smoothly and the process was interrupted (e.g. context was cancelled, connection with API server was lost, etc.).
  2. Failure because of anything else

IMO, (1) isn't really a failure. It's an indefinite pause to the progression of the upgrade that the helm client doesn't deal with well. I'm primarily concerned with recovering gracefully where we are able to simply pick up where we left off, and what I'm hoping is that there's a way to convince helm to do that.

My original proposal of "if the most recent release secret is pending, delete it" would rely on helm's adoption logic. What I would envision happening is that the next time through (after deleting the latest pending release), we would:

  1. See the "installed" release
  2. Do a dry-run upgrade, and detect a diff because the previous upgrade attempt did not complete its updates/patches
  3. Start a real upgrade
    • Any objects with the same name/namespace in existing/upgrade releases that we applied successfully during the first attempt would likely see a no-op update/patch
    • Any objects that are new in the upgrade release that we applied successfully during the first attempt would have the correct labels that would allow helm to "adopt" the object into the release
    • Any objects that we didn't get to before the interruption would be created/updated as usual.

joelanford avatar Feb 18 '25 22:02 joelanford

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle?

Consider this scenario:

  • v1 is "deployed"
  • v2 upgrade started, but didn't finish
  • v3 is now available in the catalog, there is an upgrade edge from v1, but not v2

Since we partially rolled out v2 (or maybe we fully rolled it out, but our network connection failed right as we were about to set the release secret to "deployed"), it seems like we should not proceed to v3 because we have to assume that whatever it is that makes v2 not upgradeable to v3 was included in the partial rollout.

What if the scenario changes where v3 upgrades from both v1 and v2? Would it be safe then to "resume" the upgrade but instead go to v3? I feel like there are scenarios there where the answer is still no.

For example: suppose there are two datasets managed by the operator's manifests, where:

  1. v1 has two workloads that individually migrate their respective datasets to level 1
  2. v2 has two workloads that individually migrate their respective datasets to level 2
  3. v3 expects the datasets to be at the same migration level, either both datasets at level 1 or both datasets at level 2, but not a mix.

If v2 partially rolls out such that one dataset gets migrated, but the other doesn't, then we can't migrate to v3. It seems reasonable that a v2 -> v3 upgrade edge has the implicit pre-condition that v2 is healthy (by our best possible definition of healthy).

Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Thoughts?

joelanford avatar Feb 19 '25 18:02 joelanford

One question that came to mind for me after we wrapped our discussion today: Will we be able to distinguish between "actual failure" and "resumable interruption"? I think that distinction matters because we may want to handle these scenarios differently.

Looking at what actually happens within helm/pkg/action/upgrade.go:Upgrade(), once the in-flight release is created in storage with a 'pending' state (end of performUpgrade()) and the actual upgrade logic starts, in case there was an error at any step, that 'pending' state will be replaced by failRelease() and error will also be returned to the caller.

So while resuming the interrupted release does not mean it will succeed (or succeeded already), it should mean that it did not fail up to the point of interruption, although theoretically the interruption could still happen just before the 'pending' state is replaced and recorded, in which case failure will be 'masked' by 'pending' state.

Another thing that came to mind (going back to the assumption I was talking about in the discussion). If we see a pending upgrade, would it make sense to try to complete that particular upgrade to that particular bundle before we attempt to resolve a new bundle? ... Therefore, it seems like the only way to do this safely is to complete the upgrade to v2 before we try to do anything else. In theory, the pending release secret has all of the bookkeeping necessary to actually try to resume v2, even if the v2 bundle is no longer available.

Theoretically and if we can (and want) to detect those new upgrade paths I could also see attempting to do a rollback as a valid option. That way, if there were upgrade path changes, we can do a fresh start taking those into consideration. If not and assuming a rollback will be successful and is properly set up (not sure how concerned we should be if it is) we simply retry.

Thoughts?

My general thoughts about this go back to "available" workarounds for this issue that are present in the community (eg. reading those threads I linked to in the doc comment). What's repeatedly being mentioned there is:

  • uninstall/rollback (my initial approach)
  • pending release secret deletion With some people also bringing up manually editing 'pending' status to 'failure' or 'success' (if we can detect it)

At this point I don't really have a 'clear winner' here which I could say is the absolute best way to approach it with. All of those are really 'hacks' (maybe uninstall/rollback less so) that just work around the problem and the more we discuss it it seems we can come up with more and more scenarios where each of those workarounds might not play out perfectly and we might not be able to resolve the situation to a 'sane' state.

My question would be do we really need to ensure it? Should we dig deeper and hope we find a bulletproof solution (which I don't think there is given that the main problem isn't really on our side)? Maybe with the effort so far, we should just consider picking one of those "established" ideas to provide a "best effort" workarounds on our side? To be fair, I thought we kinda did that already with secret deletion https://github.com/operator-framework/operator-controller/pull/1776#discussion_r1961163821

BTW. There has been some activity around resolving this on the helm side (either in 3.x or 4), see: https://github.com/helm/helm/issues/11863 and https://github.com/helm/community/pull/354

azych avatar Feb 20 '25 10:02 azych

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 68.57%. Comparing base (dbd2bd9) to head (4e0d40a). Report is 83 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1776      +/-   ##
==========================================
+ Coverage   68.45%   68.57%   +0.11%     
==========================================
  Files          63       63              
  Lines        5136     5158      +22     
==========================================
+ Hits         3516     3537      +21     
- Misses       1390     1391       +1     
  Partials      230      230              
Flag Coverage Δ
e2e 51.67% <64.00%> (+0.06%) :arrow_up:
unit 56.18% <100.00%> (+0.16%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :snowflake: Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

codecov[bot] avatar Feb 24 '25 13:02 codecov[bot]

thanks for the review @camilamacedo86, I'd still like to hear from @joelanford before potentially moving forward with those changes

/hold

azych avatar Mar 07 '25 09:03 azych

@camilamacedo86 Thanks for the review! For now I'm going to back out the approval so that another maintainer doesn't come and merge this before we have a chance to have a conversation about how to proceed. Our decisions here have broad impact on the safety guarantees and automation that OLM can provide, so I just want to make sure that we all agree with the ultimate direction.

The outcome of a discussion should be enshrined in an RFC to make sure we can go back and find our reference point for the ultimate decision we make since this is so central to OLM behavior.

joelanford avatar Mar 10 '25 18:03 joelanford

Thank you @joelanford

Yep, I would update this one again with

/lgtm cancel /approve cancel

Based on the discussion:

I was thinking that if the chart always ends up in 'Pending' for an unknown reason, we might have a problem if we don't have the ability to simply retry X times and after that set Failed. :-(

camilamacedo86 avatar Mar 12 '25 13:03 camilamacedo86

I was thinking that if the chart always ends up in 'Pending' for an unknown reason, we might have a problem if we don't have the ability to simply retry X times and after that set Failed. :-(

For something like that to happen (ie. release constantly ends up in 'pending-x' state), the process (generally our pod) would have to constantly keep being interrupted (eg. by deletion) during the same very specific narrow window of when we started helm action (installation/upgrade) and it did not finish because of the interruption, or more precisely helm did not manage to set a new status on the release other than pending - see https://github.com/operator-framework/operator-controller/pull/1776#issuecomment-2671051469).

Also per that linked comment and traced helm library code, whenever an error occurs during an action, pending state will be replaced with that error state and returned to the caller as a failed action with release status being set accordingly. There is a theoretical possibility of a failure happening and then an immediate interruption, which could cause relase to remain in 'pending' state even though there was a failure. With approach of deleting the in-flight release, we'd run into the same failed state during next reconciliation and won't lose anything.

I think it's also important to differentiate between helm release (k8s secret in our case) state and clusterextension state, which are two different things. Whenever a clusterextension ends up with an error during reconciliation, it will be retried and an error will be placed in 'Progressing' condition type in '.Status.Conditions'. This would be same with an error which comes from a pending helm action.

azych avatar Mar 14 '25 13:03 azych

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-merge-robot avatar Apr 10 '25 01:04 openshift-merge-robot

closing in favor of rebased version #1987 (I couldn't push the changes directly here =S)

perdasilva avatar May 22 '25 07:05 perdasilva