helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

Helm Release does not reset itself after any error - shows "reconciliation failed: upgrade retries exhausted revision"

Open scubakiz opened this issue 1 year ago • 3 comments

Describe the bug

When a HelmRelease has a problem, the problem stays forever, even after it's been fixed at the source.

If there are any issues at all with a HelmRelease, there is no way to recover it without deleting it and then have the reconcile recreate it.

Once the new one is created, it retries the upgrade and sometime succeeds.

Steps to reproduce

Have a helm release with any problem in it. Fix the problem, reconcile the release. It won't try it again until the release is deleted.

Expected behavior

Every reconcile of a HelmRelease should be independent and should work if there are no issues with the chart.

Screenshots and recordings

{"level":"info","ts":"2022-07-06T06:25:39.086Z","logger":"controller.helmrelease","msg":"reconcilation finished in 39.416357ms, next run in 4m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system"} {"level":"error","ts":"2022-07-06T06:25:39.086Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system","error":"upgrade retries exhausted"} {"level":"info","ts":"2022-07-06T06:25:45.166Z","logger":"controller.helmrelease","msg":"reconcilation finished in 78.830679ms, next run in 4m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system"} {"level":"error","ts":"2022-07-06T06:25:45.166Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system","error":"upgrade retries exhausted"} {"level":"info","ts":"2022-07-06T06:25:57.279Z","logger":"controller.helmrelease","msg":"reconcilation finished in 111.76402ms, next run in 4m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system"} {"level":"error","ts":"2022-07-06T06:25:57.279Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system","error":"upgrade retries exhausted"} {"level":"info","ts":"2022-07-06T06:26:21.324Z","logger":"controller.helmrelease","msg":"reconcilation finished in 44.912545ms, next run in 4m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system"} {"level":"error","ts":"2022-07-06T06:26:21.325Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system","error":"upgrade retries exhausted"} {"level":"info","ts":"2022-07-06T06:27:09.371Z","logger":"controller.helmrelease","msg":"reconcilation finished in 44.789963ms, next run in 4m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system"} {"level":"error","ts":"2022-07-06T06:27:09.371Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"atlas-helm-release","namespace":"flux-system","error":"upgrade retries exhausted"}

OS / Distro

N/A

Flux version

v0.31.1

Flux check

► checking prerequisites ✗ flux 0.31.1 <0.31.3 (new version is available, please upgrade) ✔ Kubernetes 1.21.9 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.22.1 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.23.2 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.19.1 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.26.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.24.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.25.5 ✔ all checks passed

Git provider

GitHub

Container Registry provider

Azure Container Registry

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

scubakiz avatar Jul 06 '22 06:07 scubakiz

This is a problem for me also. It would be good to have a simple command that can reset the helm release so it starts the retries again, rather than fail and no way to recover other than removing the helm release.

MarkLFT avatar Jul 12 '22 03:07 MarkLFT

Upon further testing, I see this problem happen all the time, not just when errors occur. Basically, after the initial release, the HelmRelease goes into a permanent sleep. If you update the source repo a few days later, the HelmRelease never picks it up and applies the changes (even though the GitRepository gets the changes, as documented by the alert it fires).

If you delete the HelmRelease, it's replacement works fine.

In short: HelmRelease goes dormant after its initial run or two and never wakes up.

scubakiz avatar Jul 12 '22 23:07 scubakiz

It is a huge problem for me. This issue seems to be a duplicate of #454 . Instead of deleteing the helmrelease, flux suspend + flux resume the helmrelease workarounds the problem for me.

hedwig2013 avatar Jul 13 '22 13:07 hedwig2013

In the v0.37.0 release of the helm-controller, two new annotations were introduced to reset the failure counters to allow the controller to retry according to the configured remediation strategy, and to allow a one-off forced Helm install or upgrade.

You can read more about this in this blog post.

hiddeco avatar Dec 12 '23 17:12 hiddeco