flux2
flux2 copied to clipboard
Flux resume - HelmChart is not ready
Describe the bug
after updating HelmRelease spec.chart.spec.version, "flux resume helmrelease
► resuming HelmRelease qa-2 in qa-2 namespace ✔ HelmRelease resumed ◎ waiting for HelmRelease reconciliation ✗ HelmChart 'flux-system/qa-2-qa-2' is not ready
I can see the HelmChart resource still being updated before the error message is returned.
A second run of flux resume helmrelease
To Reproduce
Steps to reproduce the behaviour:
- deploy flux HelmRelease manifest with suspend: true
- run "flux resume helmrelease
-n " - modify helm chart version in HelmRelease, commit to repo
- run "flux resume helmrelease
-n "
Expected behavior
flux resume should wait for the helm chart to be ready and not error
Additional context
Note: I am effectively using flux in 'manual' mode with suspend: true and using flux resume to deploy the releases.
flux --version
flux version 0.5.4
► checking prerequisites
✔ kubectl 1.18.12 >=1.18.0
✔ Kubernetes 1.17.13 >=1.16.0
► checking controllers
✔ source-controller is healthy
► ghcr.io/fluxcd/source-controller:v0.7.4
✔ kustomize-controller is healthy
► ghcr.io/fluxcd/kustomize-controller:v0.7.4
✔ helm-controller is healthy
► ghcr.io/fluxcd/helm-controller:v0.6.1
✔ notification-controller is healthy
► ghcr.io/fluxcd/notification-controller:v0.7.1
✔ all checks passed
does "flux resume source chart qa-2-qa-2 -n flux-system" also need to be run before this command to ensure the HelmChart is resumed? I would have expected flux resume HelmRelease to do that...?
The suspend
or resume
command on the HelmRelease
does not touch the underlying chart.
You are also running an older version of flux
, given we released 0.8.0
last Friday.
how should I get flux to update the helmchart resource? flux resume source chart just fetches the old version of the chart again...
The HelmChart
is updated by the HelmRelease
when the reconciliation for the resource is resumed. When this happens, the HelmChart
temporary becomes "not ready" until the new revision has been fetched, which is what you see happen.
I think your expectation of it not returning an error is valid, but will require some thinking about how to correctly detect and wait for the chart update.
ok thanks, currently working around this by running the resume HelmRelease command twice.
There are two ways this error can happen:
- HelmRepository is slow to reconcile, and so the HelmChart is simply not ready by the time the first
resume
command gets around to the second phase of reconciling the HelmRelease, when the HelmChart is not quite ready in time - HelmRelease is actually pointed at an incorrect chart ref, which will never reconcile a new HelmChart resource
Most of the times that I've seen this error it has been the first case, the HelmChart simply isn't ready and it will be ready in a few seconds, so reconciling helmrelease
twice succeeds the second time.
Does this issue still impact users? I used to get this report all the time, but I don't see it happening as often anymore. Maybe we've fixed it.
The second problem looks like the first to users, except that reconciling a second time does not succeed because the chartref is actually incorrect. That would be a separate issue. In any case I think this issue can be closed now. Is there any information I can add before closing it out? Thanks!
There are two ways this error can happen:
- HelmRepository is slow to reconcile, and so the HelmChart is simply not ready by the time the first
resume
command gets around to the second phase of reconciling the HelmRelease, when the HelmChart is not quite ready in time- HelmRelease is actually pointed at an incorrect chart ref, which will never reconcile a new HelmChart resource
Most of the times that I've seen this error it has been the first case, the HelmChart simply isn't ready and it will be ready in a few seconds, so reconciling
helmrelease
twice succeeds the second time.Does this issue still impact users? I used to get this report all the time, but I don't see it happening as often anymore. Maybe we've fixed it.
The second problem looks like the first to users, except that reconciling a second time does not succeed because the chartref is actually incorrect. That would be a separate issue. In any case I think this issue can be closed now. Is there any information I can add before closing it out? Thanks!
Is there any update about this?
@KishinNext What kind of update are you looking for exactly?
The way that I have handled this type of intermittent failure when I encounter it on my cluster is two-pronged approach:
(1) Use the Alert.spec.exclusionList
to ensure that messages representing a temporary failure are not escalated as slack notices (we are not interested in them unless they persist for longer than 10 minutes or so) - here's my example:
https://github.com/kingdonb/bootstrap-repo/blob/980dc7ac5fc6447ac37d2c7c6bdfdee74f321d6e/clusters/moo-cluster/flux-system-extras/on-call-webapp-alert.yaml#L11-L12
(2) Use a Prometheus AlertManager alert to notify when those temporary failure situations do not resolve themselves in a reasonable amount of time. (I followed the example alert at the end of the Monitoring with Prometheus guide, which presumes any Flux resources that do not have a Ready status are in a trouble condition and need attention when they stayed that way for too long.)
I re-read the issue and I see that we're acknowledging there is an actual issue, that we don't want to report errors when there really isn't an error, but I wouldn't hold out hope that we're going to permanently solve it soon. There are good ways to resolve this now. Please find me in #flux on CNCF slack if you'd like to try the prometheus method but need help. (It wasn't straightforward setting up alertmanager, but I was able to figure it out...)