flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Flux resume - HelmChart is not ready

Open iacou opened this issue 3 years ago • 8 comments

Describe the bug

after updating HelmRelease spec.chart.spec.version, "flux resume helmrelease -n " fails with:

► resuming HelmRelease qa-2 in qa-2 namespace ✔ HelmRelease resumed ◎ waiting for HelmRelease reconciliation ✗ HelmChart 'flux-system/qa-2-qa-2' is not ready

I can see the HelmChart resource still being updated before the error message is returned. A second run of flux resume helmrelease -n succeeds, as the HelmChart has had time to become ready,

To Reproduce

Steps to reproduce the behaviour:

  1. deploy flux HelmRelease manifest with suspend: true
  2. run "flux resume helmrelease -n "
  3. modify helm chart version in HelmRelease, commit to repo
  4. run "flux resume helmrelease -n "

Expected behavior

flux resume should wait for the helm chart to be ready and not error

Additional context

Note: I am effectively using flux in 'manual' mode with suspend: true and using flux resume to deploy the releases.

flux --version

flux version 0.5.4

► checking prerequisites
✔ kubectl 1.18.12 >=1.18.0
✔ Kubernetes 1.17.13 >=1.16.0
► checking controllers
✔ source-controller is healthy
► ghcr.io/fluxcd/source-controller:v0.7.4
✔ kustomize-controller is healthy
► ghcr.io/fluxcd/kustomize-controller:v0.7.4
✔ helm-controller is healthy
► ghcr.io/fluxcd/helm-controller:v0.6.1
✔ notification-controller is healthy
► ghcr.io/fluxcd/notification-controller:v0.7.1
✔ all checks passed

iacou avatar Feb 17 '21 11:02 iacou

does "flux resume source chart qa-2-qa-2 -n flux-system" also need to be run before this command to ensure the HelmChart is resumed? I would have expected flux resume HelmRelease to do that...?

iacou avatar Feb 17 '21 11:02 iacou

The suspend or resume command on the HelmRelease does not touch the underlying chart.

You are also running an older version of flux, given we released 0.8.0 last Friday.

hiddeco avatar Feb 17 '21 11:02 hiddeco

how should I get flux to update the helmchart resource? flux resume source chart just fetches the old version of the chart again...

iacou avatar Feb 17 '21 12:02 iacou

The HelmChart is updated by the HelmRelease when the reconciliation for the resource is resumed. When this happens, the HelmChart temporary becomes "not ready" until the new revision has been fetched, which is what you see happen.

I think your expectation of it not returning an error is valid, but will require some thinking about how to correctly detect and wait for the chart update.

hiddeco avatar Feb 17 '21 12:02 hiddeco

ok thanks, currently working around this by running the resume HelmRelease command twice.

iacou avatar Feb 17 '21 13:02 iacou

There are two ways this error can happen:

  1. HelmRepository is slow to reconcile, and so the HelmChart is simply not ready by the time the first resume command gets around to the second phase of reconciling the HelmRelease, when the HelmChart is not quite ready in time
  2. HelmRelease is actually pointed at an incorrect chart ref, which will never reconcile a new HelmChart resource

Most of the times that I've seen this error it has been the first case, the HelmChart simply isn't ready and it will be ready in a few seconds, so reconciling helmrelease twice succeeds the second time.

Does this issue still impact users? I used to get this report all the time, but I don't see it happening as often anymore. Maybe we've fixed it.

The second problem looks like the first to users, except that reconciling a second time does not succeed because the chartref is actually incorrect. That would be a separate issue. In any case I think this issue can be closed now. Is there any information I can add before closing it out? Thanks!

kingdonb avatar Nov 17 '21 13:11 kingdonb

There are two ways this error can happen:

  1. HelmRepository is slow to reconcile, and so the HelmChart is simply not ready by the time the first resume command gets around to the second phase of reconciling the HelmRelease, when the HelmChart is not quite ready in time
  2. HelmRelease is actually pointed at an incorrect chart ref, which will never reconcile a new HelmChart resource

Most of the times that I've seen this error it has been the first case, the HelmChart simply isn't ready and it will be ready in a few seconds, so reconciling helmrelease twice succeeds the second time.

Does this issue still impact users? I used to get this report all the time, but I don't see it happening as often anymore. Maybe we've fixed it.

The second problem looks like the first to users, except that reconciling a second time does not succeed because the chartref is actually incorrect. That would be a separate issue. In any case I think this issue can be closed now. Is there any information I can add before closing it out? Thanks!

Is there any update about this?

KishinNext avatar Jan 18 '22 02:01 KishinNext

@KishinNext What kind of update are you looking for exactly?

The way that I have handled this type of intermittent failure when I encounter it on my cluster is two-pronged approach:

(1) Use the Alert.spec.exclusionList to ensure that messages representing a temporary failure are not escalated as slack notices (we are not interested in them unless they persist for longer than 10 minutes or so) - here's my example:

https://github.com/kingdonb/bootstrap-repo/blob/980dc7ac5fc6447ac37d2c7c6bdfdee74f321d6e/clusters/moo-cluster/flux-system-extras/on-call-webapp-alert.yaml#L11-L12

(2) Use a Prometheus AlertManager alert to notify when those temporary failure situations do not resolve themselves in a reasonable amount of time. (I followed the example alert at the end of the Monitoring with Prometheus guide, which presumes any Flux resources that do not have a Ready status are in a trouble condition and need attention when they stayed that way for too long.)

I re-read the issue and I see that we're acknowledging there is an actual issue, that we don't want to report errors when there really isn't an error, but I wouldn't hold out hope that we're going to permanently solve it soon. There are good ways to resolve this now. Please find me in #flux on CNCF slack if you'd like to try the prometheus method but need help. (It wasn't straightforward setting up alertmanager, but I was able to figure it out...)

kingdonb avatar Jan 18 '22 13:01 kingdonb