helm-controller icon indicating copy to clipboard operation
helm-controller copied to clipboard

helmrelease "upgrade retries exhausted" regression

Open HaveFun83 opened this issue 4 years ago • 39 comments

Describe the bug

When a helmrelease stuck in reconciliation failed: upgrade retries exhausted only fluxcli v1.16.1 can trigger a successful reconciliation .###

Steps to reproduce

When a helmrelease stuck in helm-controller reconciliation failed: upgrade retries exhausted this can normally be fixed by running ` ./flux reconcile helmrelease from the command line, but only till fluxcli v0.16.1

Expected behavior

flux reconcile should trigger a helm upgrade when it stuck in in upgrade retries exhausted

Screenshots and recordings

This time i upgrade kube-prometheus-stack helmrelease

I tried different versions v0.17.2 v0.16.2 but only v0.16.1 triggered a successful helm upgrade

❯ flux -v
flux version 0.17.2
❯ flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✗ HelmRelease reconciliation failed

❯ ./flux -v
flux version 0.16.2
❯ ./flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✗ HelmRelease reconciliation failed

❯ ./flux -v
flux version 0.16.1
❯ ./flux reconcile helmrelease -n monitoring infra --with-source
► annotating HelmRepository prometheus-community in flux-system namespace
✔ HelmRepository annotated
◎ waiting for HelmRepository reconciliation
✔ HelmRepository reconciliation completed
✔ fetched revision 6b8293a6fda62b3318b3bbe18e9e4654b07b3c80
► annotating HelmRelease infra in monitoring namespace
✔ HelmRelease annotated
◎ waiting for HelmRelease reconciliation
✔ HelmRelease reconciliation completed
✔ applied revision 18.1.1
❯ k describe hr -n monitoring infra

Events:
  Type    Reason  Age                 From             Message
  ----    ------  ----                ----             -------
  Normal  info    19m (x403 over 8d)  helm-controller  HelmChart 'flux-system/monitoring-infra' is not ready
  Normal  info    18m (x9 over 8d)    helm-controller  Helm upgrade has started
  Normal  error   18m                 helm-controller  Helm upgrade failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused

Last Helm logs:

Looks like there are no changes for Service "infra-prometheus-node-exporter"
Looks like there are no changes for DaemonSet "infra-prometheus-node-exporter"
error updating the resource "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules":
   cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF
error updating the resource "infra-kube-prometheus-stac-kube-apiserver-histogram.rules":
   cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
warning: Upgrade "infra" failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
  Normal  error  18m                    helm-controller  reconciliation failed: Helm upgrade failed: cannot patch "infra-kube-prometheus-stac-kube-apiserver-burnrate.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": EOF && cannot patch "infra-kube-prometheus-stac-kube-apiserver-histogram.rules" with kind PrometheusRule: Internal error occurred: failed calling webhook "prometheusrulemutate.monitoring.coreos.com": Post "https://infra-kube-prometheus-stac-operator.monitoring.svc:443/admission-prometheusrules/validate?timeout=10s": dial tcp 10.240.28.152:443: connect: connection refused
  Normal  error  17m (x6 over 8d)       helm-controller  reconciliation failed: Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io "infra": the object has been modified; please apply your changes to the latest version and try again
  Normal  error  14m (x386 over 5d16h)  helm-controller  reconciliation failed: upgrade retries exhausted
  Normal  error  7m5s (x19 over 12m)    helm-controller  reconciliation failed: upgrade retries exhausted
  Normal  info   2m28s (x2 over 4m11s)  helm-controller  Helm upgrade has started
  Normal  info   118s (x2 over 3m39s)   helm-controller  Helm upgrade succeeded

hlem-controller logs:

❯ klf -n flux-system helm-controller-dc6ffd55b-rg6qk
{"level":"info","ts":"2021-09-30T08:01:18.602Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":"2021-09-30T08:01:18.603Z","logger":"setup","msg":"starting manager"}
I0930 08:01:18.603524       6 leaderelection.go:243] attempting to acquire leader lease flux-system/helm-controller-leader-election...
{"level":"info","ts":"2021-09-30T08:01:18.603Z","msg":"starting metrics server","path":"/metrics"}
I0930 08:01:18.630310       6 leaderelection.go:253] successfully acquired lease flux-system/helm-controller-leader-election
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting EventSource","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting EventSource","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","source":"kind source: /, Kind="}
{"level":"info","ts":"2021-09-30T08:01:18.704Z","logger":"controller.helmrelease","msg":"Starting Controller","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease"}
{"level":"info","ts":"2021-09-30T08:01:18.805Z","logger":"controller.helmrelease","msg":"Starting workers","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","worker count":4}
{"level":"info","ts":"2021-09-30T08:01:21.488Z","logger":"controller.helmrelease","msg":"reconcilation finished in 2.458244966s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:21.488Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:23.376Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.882877287s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:23.376Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:25.090Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.703418398s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:25.090Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:26.792Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.680926502s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:26.792Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:28.450Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.618160674s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:28.450Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:30.191Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.660457185s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:30.191Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:32.050Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.697716689s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:32.050Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:34.059Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.687846249s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:34.059Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:36.377Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.677888858s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:36.377Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"error","ts":"2021-09-30T08:01:39.382Z","logger":"controller.helmrelease","msg":"unable to update status after reconciliation","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io \"infra\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"error","ts":"2021-09-30T08:01:39.382Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"Operation cannot be fulfilled on helmreleases.helm.toolkit.fluxcd.io \"infra\": the object has been modified; please apply your changes to the latest version and try again"}
{"level":"info","ts":"2021-09-30T08:01:41.124Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.742011934s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:41.124Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:43.644Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.702195992s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:43.644Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:01:55.508Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.622736012s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:01:55.508Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:02:17.669Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.680522601s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:02:17.669Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:03:00.294Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.664480571s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:03:00.294Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:04:23.880Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.665473894s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:04:23.880Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:04:46.197Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.637545948s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:04:46.197Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:05:52.746Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.681377434s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:05:52.746Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:07:09.465Z","logger":"controller.helmrelease","msg":"reconcilation finished in 1.74306092s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}
{"level":"error","ts":"2021-09-30T08:07:09.465Z","logger":"controller.helmrelease","msg":"Reconciler error","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring","error":"upgrade retries exhausted"}
{"level":"info","ts":"2021-09-30T08:10:35.591Z","logger":"controller.helmrelease","msg":"reconcilation finished in 34.25555514s, next run in 5m0s","reconciler group":"helm.toolkit.fluxcd.io","reconciler kind":"HelmRelease","name":"infra","namespace":"monitoring"}

helmrelease spec:

---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: infra
  namespace: monitoring
spec:
  interval: 5m
  chart:
    spec:
      # renovate: registryUrl=https://prometheus-community.github.io/helm-charts
      chart: kube-prometheus-stack
      version: 19.0.1
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
      interval: 1m
  install:
    crds: Create
  upgrade:
    crds: CreateReplace
  # valuesFrom:
  # - kind: Secret
  #   name: kube-prometheus-values
  #   # valuesKey: values.yaml
  values:
  ...

OS / Distro

Ubuntu 20.04

Flux version

0.17.2

Flux check

❯ flux check ► checking prerequisites ✔ kubectl 1.20.11 >=1.18.0-0 ✔ Kubernetes 1.19.15 >=1.16.0-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.11.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.14.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.16.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.15.4 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

Maybe we can collect some kind of documentation how to get out of this "upgrade exhausted" situation?

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

HaveFun83 avatar Sep 30 '21 08:09 HaveFun83

This also happens in v0.18.1

HaveFun83 avatar Oct 12 '21 11:10 HaveFun83

I've noticed that helm list does not show release in question in that situation but pods are actually running.

Also, I've tried to remove last helm secret for release and then reconcile HelmRelease and reconciliation was successful.

tbondarchuk avatar Oct 19 '21 13:10 tbondarchuk

I've noticed that helm list does not show release in question in that situation but pods are actually running.

Also, I've tried to remove last helm secret for release and then reconcile HelmRelease and reconciliation was successful.

@aliusmiles you can use helm ls -a to list all helm releases in all conditions not only with status deployed

HaveFun83 avatar Oct 19 '21 15:10 HaveFun83

@HaveFun83 Thanks, now I've learned something new about helm:) Guess my previous comment should say: "helm release in question is not in deployed or failed state"

tbondarchuk avatar Oct 19 '21 15:10 tbondarchuk

Sorry it took awhile for me to get to this, it were busy weeks for other Flux parts, and then KubeCon.

When you say it worked up till 0.16.1, is this the CLI only, or does it include the helm-controller version that matches this release?

hiddeco avatar Oct 20 '21 09:10 hiddeco

Hello @hiddeco no problem it is not urgent and i know how busy kubecon weeks are :beers: Thanks for having a look into it. The helm-controller is updated constantly i tested it with fluxcli 0.16.1 but helm-controller version from fluxcd release v0.17.2 currently we are running fluxcd v0.18.3. It looks like that the method changed how fluxcli helmreleases reconciles are triggered from v0.16.1 to v0.16.2 onward or how the helm-controller react on reconciles from different fluxcli versions.

HaveFun83 avatar Oct 20 '21 14:10 HaveFun83

Hello!

I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:

flux suspend hr <release_name>
flux resume hr <release_name>

it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.

Hopefully this helps to people also facing the same issue.

vladimirjk avatar Nov 05 '21 20:11 vladimirjk

Hello!

I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:

flux suspend hr <release_name>
flux resume hr <release_name>

it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.

Hopefully this helps to people also facing the same issue.

Thanks a lot for this workaround.

HaveFun83 avatar Nov 06 '21 21:11 HaveFun83

Actually this seems to be even worse currently. I actually can't even get the suspend/resume workaround to free up the resource (even if the resource is updated in the source).

siegenthalerroger avatar Jan 03 '22 18:01 siegenthalerroger

In my experience, the important setting to control (or problematic setting if you miss it) is spec.timeout

If you haven't set a value for spec.timeout you might have trouble diagnosing problematic HelmReleases. Historically they would fail to post errors as events because the helmrelease never timing out means the error is never formally raised in the Helm package that Helm Controller uses as upstream logic for its Helm-related activities. I'm not sure if that's still the case, but I still recommend setting spec.timeout to everyone as soon as they report trouble with Helm Controller because it makes the failure mode and behavior more predictable.

I'm not sure what happened in Flux 0.17 that might be relevant to this issue, but if you set spec.timeout to some reasonable value like 2m0s and wait at least that long, you should start to see errors that will lead you towards a solution. (The errors would generally appear in the kubectl describe HelmRelease output, listed as an event.)

If this does not immediately resolve your issue @siegenthalerroger maybe post the content of your HelmRelease and we can have a look at the details? Without more information, I'm afraid we won't be able to tell if this is the same issue or help much with finding out the root cause.

kingdonb avatar Jan 04 '22 13:01 kingdonb

Hi @kingdonb, to be clear it was my HelmRelease that was broken, it wasn't an issue with the helm-controller in any way. My issue is that once the HelmRelease is in the "upgrade retries exhausted" state, I have no way of getting it to try again when changed in the source without deleting the HelmRelease. The workaround above which I personally have used before, didn't work in my case anymore ^^.

Thanks for the tip about spec.timeout though, that'll prove useful when I'm debugging a different HelmRelease.

siegenthalerroger avatar Jan 06 '22 11:01 siegenthalerroger

This is another suggestion, although I don't like it as much it may also have worked for you:

When you need to trigger a new HelmRelease reconciliation after "upgrade retries exhausted" and you aren't in a position to run helm rollback or helm uninstall, try editing spec.values – this is one place where an untyped values comes in handy, you can invent a new value that doesn't mean anything, say spec.values.nonce and just update it.

Helm does not type values.yaml so it has no way of knowing that change to nonce doesn't actually update anything when it is substituted into the templates, and it cannot know because there's no mechanism in helm to detect what types of changes are made by any post-install or post-upgrade hooks there might have been in any given Helm chart. (Any hooks might care about the value of nonce as they can be running processes that manipulate the state of the release in post.)

Helm will be forced to run the upgrade again each time you update the nonce value. Hope this helps as well!

kingdonb avatar Jan 06 '22 14:01 kingdonb

I could have sworn I tried that but it seems to work now so not sure. What I know for sure is that downgrading the chart version did not save it, I had to delete the helm release and have the source controller recreate it.

siegenthalerroger avatar Jan 16 '22 22:01 siegenthalerroger

We've been seeing this behavior as well. Occasionally we'll have an HR fail for some reason or another (e.g. the service took too long to startup and we didn't have retries set accordingly). All I want is to be able to retry the HR once I've resolved the issue. Flux 0.16.1 allowed me to do that with flux reconcile hr, but with later versions flux reconcile hr appears to do nothing other than to tell me it's failed. I have a copy of Flux 0.16.1 that I keep around for retrying the HRs, as it's the only way I'm aware of to do it without making some superfluous commits to our repo.

jmriebold avatar Jan 19 '22 03:01 jmriebold

We've been seeing this behavior as well. Occasionally we'll have an HR fail for some reason or another (e.g. the service took too long to startup and we didn't have retries set accordingly). All I want is to be able to retry the HR once I've resolved the issue. Flux 0.16.1 allowed me to do that with flux reconcile hr, but with later versions flux reconcile hr appears to do nothing other than to tell me it's failed. I have a copy of Flux 0.16.1 that I keep around for retrying the HRs, as it's the only way I'm aware of to do it without making some superfluous commits to our repo.

I've had success working around this by doing a suspend followed by a resume.

ammmze avatar Jan 21 '22 02:01 ammmze

Also happening with Flux CLI Version 0.24.1: flux reconcile hr <name> -->HelmRelease reconciliation failed: install retries exhausted

Workaround as suggest above works: flux suspend hr <name> followed by flux resume hr <name> are working in terms of a workaround.

snukone avatar Feb 04 '22 15:02 snukone

@snukone The Flux 0.26.1 release out this week has lots of Helm updates that will make Helm fail less often, according to reports we've received.

I have heard mixed reports about whether suspend/resume will actually retry a failed HelmRelease that exhausted retries or not, it may depend on how it failed. I'd be surprised if install retries exhausted was solved that way in fact, since a failed install leaves a secret behind, and I think the secret records it as failed? I guess I'm in the minority here if this doesn't work for me.

In any case I think you'd have to configure remediationStrategy settings for your preferred number of retries, and/or remediation method. It sounds like the days are long gone when your best option was running helm uninstall and trying again. 👍

kingdonb avatar Feb 04 '22 15:02 kingdonb

Hi @kingdonb , all right, thanks a lot for providing your explanation. Your right, fixing errors with just uninstall + install is a long time ago ;) In our case we have sometimes services based on helmcharts on dev enviroments, which arent important and havent been used for a longer time. Thats when we ignore installation errors, because in 99% they are just occuring because of outdated helmchart versions (e.g. the image isnt available any more). Just to know that supsending and resuming does the same as reconcile in the past is ok for me. Because on important enviroments where the services have to run everytime, we imitially get informed by alerting and a "install retries exhausted" error could hardly happen.

snukone avatar Feb 07 '22 09:02 snukone

flux version 0.26.3

First time install i get install retries exhausted. after first time i get upgrade retries exhausted. I have tried the above solutions and none have worked for me. flux suspend and resume did not woked. I had added

 upgrade:
    remediation:
      remediateLastFailure: true

But did not worked either. thanks to @onedr0p i have fixed it with helm uninstall <app> and after that making a commit or reconcile. Should work then :)

Y0ngg4n avatar Feb 13 '22 20:02 Y0ngg4n

I've run into situations where helm uninstall ... or flux delete hr ... was the only way to resolve this issue as well, suspend/resume had no effect. Next time it happens I'll try to have more information. Seems like Flux gets stuck on trying to install or upgrade and only a fresh install of the helm release fixes it.

onedr0p avatar Feb 13 '22 21:02 onedr0p

In my experience, the important setting to control (or problematic setting if you miss it) is spec.timeout

If you haven't set a value for spec.timeout you might have trouble diagnosing problematic HelmReleases. Historically they would fail to post errors as events because the helmrelease never timing out means the error is never formally raised in the Helm package that Helm Controller uses as upstream logic for its Helm-related activities. I'm not sure if that's still the case, but I still recommend setting spec.timeout to everyone as soon as they report trouble with Helm Controller because it makes the failure mode and behavior more predictable.

I'm not sure what happened in Flux 0.17 that might be relevant to this issue, but if you set spec.timeout to some reasonable value like 2m0s and wait at least that long, you should start to see errors that will lead you towards a solution. (The errors would generally appear in the kubectl describe HelmRelease output, listed as an event.)

If this does not immediately resolve your issue @siegenthalerroger maybe post the content of your HelmRelease and we can have a look at the details? Without more information, I'm afraid we won't be able to tell if this is the same issue or help much with finding out the root cause.

@kingdonb where do I set spec.timeout? for me at least, everytime that I run into install retries exhausted, one workaround is to manually edit the helm release:

kubectl edit hr <helmreleasename>

and manually add spec.timeout. Where can I set it such that it always sets a timeout? I am creating my helm releases like this:

flux create hr myhr \
    --target-namespace=mynamespace \
    --chart=mychart \
    --source=gitrepository/mysource \
    --timeout 15m

but timeout 15m is not being replicated to the helm release spec, but rather it seems it sends that time out to helm command itself. Is the only way to set spec.timeout to use a yaml file instead of flux cli for creating a helm release?

scardena avatar Mar 18 '22 06:03 scardena

Is the only way to set spec.timeout to use a yaml file instead of flux cli for creating a helm release?

I'm fairly sure that's correct. --timeout is a global option on the flux cli, it does not pass through to HelmRelease.

kingdonb avatar Mar 18 '22 16:03 kingdonb

In my case, I had the wrong RBAC permissions used by the HelmRelease and it got stuck in install retries exhausted. After fixing the RBAC, it doesn't try to install again. suspend and resume will give it a kick and retry installing which succeeded. Maybe a new command is needed instead of this workaround?

tomhuang12 avatar Mar 22 '22 21:03 tomhuang12

Hey, got exactly same issue while upgrading kube-prometheus-stack

Status was HelmRelease reconciliation failed: upgrade retries exhausted and there was no other way to progress than executing suspend/resume workaround.

flux: v0.28.2
helm-controller: v0.18.1
kustomize-controller: v0.22.1
notification-controller: v0.23.1
source-controller: v0.22.2

kirek007 avatar Mar 25 '22 10:03 kirek007

(deleted) The underlying issues here are error message. Users that experience this error have misconfigured a Helm chart.

doctorpangloss avatar May 09 '22 00:05 doctorpangloss

Hey folks,

Just wanted to echo the same here as I did in https://github.com/fluxcd/helm-controller/issues/149#issuecomment-1111860111. This message is from a little more than a week ago, and I am now after #477 at the point to start rewriting the release logic. While doing this, I will take this long standing issue into account, and ensure it's covered with a regression test.

hiddeco avatar May 09 '22 10:05 hiddeco

Hello!

I can confirm this issue is still relevant for the latest version of a helm-controller. The workaround for now is this:

flux suspend hr <release_name>
flux resume hr <release_name>

it will reconcile broken release states such as "exhausted" and "another rollback/release is in progress". Works for me.

Hopefully this helps to people also facing the same issue.

thanks! at the moment, this seems to be the only option for releases up to v0.21.0

rabbice avatar May 12 '22 13:05 rabbice

@rabbice suspend and resume does not work for me. i need to always delete the helmrelease and reconcile again

Y0ngg4n avatar May 12 '22 16:05 Y0ngg4n

It didn't work for me either. I'm still using an old copy of Flux 0.16.1 to manually trigger reconciliation of failed HRs. Removing the HR is not an option, as that would result in downtime.

jmriebold avatar May 12 '22 16:05 jmriebold

In my case I didn't see error messaging that I was trying to use a service account in a different namespace. This interacts with eksctl, since on AWS you have to declare the service accounts in a meaningful way outside of your manifests. The issue is really error messaging.

doctorpangloss avatar May 12 '22 17:05 doctorpangloss