Upgrade from Flux 2.1.x to 2.2.2 leaves most HelmReleases in a broken state
Describe the bug
It seems that after the upgrade some HelmReleases are migrated to the newer API object but some aren't. This won't go away unless the HelmReleases is removed or reconcile --force is used.
Most obvious thing is that the message in flux get hr doesn't show the new information. The more breaking thing is that dependencies aren't considered ready even when dependency is Ready and the message shows Helm upgrade succeeded (see ingress-nginx and cert-manager in the output below for example).
Seems pretty similiar to what this PR is trying to solve: https://github.com/fluxcd/helm-controller/pull/850
Which should be in v2.2.2 where I am still having this issue.
flux get hr output right after the upgrade:
# flux get hr
NAME REVISION SUSPENDED READY MESSAGE
azure-workload-identity 1.1.0 False True Helm upgrade succeeded
cert-exporter 3.4.1 False True Helm upgrade succeeded for release guida-system/cert-exporter.v4 with chart [email protected]
cert-manager v1.13.1 False True Helm upgrade succeeded
flux 2.12.2 False True Helm upgrade succeeded for release flux-system/flux.v5 with chart [email protected]
helm-exporter 1.2.11+7a3ebb3 False True Helm upgrade succeeded
ingress-nginx 4.8.3 False False dependency 'flux-system/cert-manager' is not ready
kyverno 3.1.1 False True Helm upgrade succeeded
prometheus-operator 51.2.0 False True Helm upgrade succeeded for release guida-system/prometheus-operator.v3 with chart [email protected]
rbac-manager 1.17.6 False True Helm upgrade succeeded
sealed-secrets 2.13.0 False True Helm upgrade succeeded
velero 5.0.2 False True Helm upgrade succeeded
All the releases showing Helm upgrade succeeded or dependency 'flux-system/xxx' is not ready won't go to the new message without a --force or deletion.
I tried:
- Adding an annotation and migrating to
v2beta2for all the HelmReleases as described here: https://github.com/fluxcd/helm-controller/pull/850#pullrequestreview-1782152094
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
annotations:
fluxcd.io/upgradeTo: v2beta2
- Enabled driftDetection on the HelmReleases after the above as suggested here: https://github.com/fluxcd/flux2/issues/4511#issuecomment-1867618282
driftDetection:
ignore:
- paths:
- /spec/replicas
target:
kind: Deployment
mode: enabled
It would be extremely nice if the upgrade could be autonomous and does not require human intervention to run reconcile --force of all HelmReleases. The --force will break in some occasions as well (AWS with an NLB on a Service for example).
Steps to reproduce
- Have Flux 2.1.1 running on the cluster with several HelmReleases
- Upgrade Flux to 2.2.2
- See the output of
flux get hrwith different messages and stuck dependencies
Expected behavior
All the HelmReleases to show the new message and being accepted as ready
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
v2.2.2
Flux check
► checking prerequisites ✔ Kubernetes 1.27.5+k3s1 >=1.26.0-0 ► checking version in cluster ✔ distribution: flux-2.2.2 ✔ bootstrapped: false ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.37.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.2.1 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.2.3 ► checking crds ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
Reconcile log of a 'stuck' HelmRelease:
{"level":"info","ts":"2024-01-03T15:34:41.074Z","msg":"HelmChart/flux-system/flux-system-azure-workload-identity with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}
{"level":"info","ts":"2024-01-03T15:34:41.299Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}
Reconcile log of a update HelmRelease:
{"level":"info","ts":"2024-01-03T15:36:07.133Z","msg":"HelmChart/flux-system/flux-system-cert-exporter with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}
{"level":"info","ts":"2024-01-03T15:36:07.348Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}
All seems happy in the helm-controller to me :)
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Let me know If I can provide some more information. I can easily recreate this behavior on my local cluster consistently.
I can confirm that this also happened to me even with the 2.2.2 release.
This might be related: https://github.com/fluxcd/flux2/issues/4529
This is probably related:
# flux reconcile helmrelease rabbitmq
✗ failed to get API group resources: unable to retrieve the complete list of server APIs: helm.toolkit.fluxcd.io/v2beta2: the server could not find the requested resource
the thing is, my helmrelease has apiVersion v2beta1 not v2beta2 and my check command does not even show beta2:
# flux check
► checking prerequisites
✔ Kubernetes 1.26.6+k3s-e18037a7-dirty >=1.26.0-0
► checking version in cluster
✔ distribution: flux-v2.1.0
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.36.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.1.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.1.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.1.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed
Not sure why it says distribution-2.1.0, because I have:
# flux --version
flux version 2.2.2
After reverting to flux 2.1.0, everything works again.
@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2. The 2.2.x CLI isn't backwards compatible with the 2.1.x release:
Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version. https://github.com/fluxcd/flux2/issues/4490#issuecomment-1858614360
Can you please post here the kubectl get hr -o yaml --show-managed-fields for cert-manager or any of the dependant HRs.
@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2. The 2.2.x CLI isn't backwards compatible with the 2.1.x release:
Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version. #4490 (comment)
Indeed, thank you for your answer! Sorry for the noob question...
@wilmardo can you please provide some detailed instructions to reproduce this issue? Based on your issue description, I tried a few things but couldn't reproduce it. Some detailed steps with example configuration or even a test repository with just the necessary configurations to help reproduce it would be very helpful.
Yes! Will get back to this, busy with other thing at the moment and we postponed this update for now. Hopefully in the beginning of next week I have more time to gather info and reproduce the issue again.
@darkowlzz Will try to get something together but I don't know if it might be something very specific to our in-house stuff that is triggering this.
This might be related although the issue is a bit vague: https://github.com/fluxcd/helm-controller/issues/891
Hi, we got another report of a similar issue today on slack and that revealed some helpful hints to the issue. I created a potential theory for what's causing this and some potential solutions for it. Refer https://github.com/fluxcd/helm-controller/pull/884 and https://github.com/fluxcd/helm-controller/pull/885 for details about it.
I can briefly explain the observations here too. The "dependency
I've shared some more details about my attempts to reproduce this issue in https://github.com/fluxcd/helm-controller/pull/885#issuecomment-1924934731. Based on that, I think the changes in https://github.com/fluxcd/helm-controller/pull/885 should make the situation better and surface the actual issue. It would be great if people who are facing this issue can try the preview image of that PR using
ghcr.io/fluxcd/helm-controller:preview-ac9e62ad@sha256:b3d9cc5e440f0b8ed83c1d5832c6f49a7e648f70e8093e85595902cd4891b9b3
It's an official preview image built using the flux release infrastructure, refer https://github.com/fluxcd/helm-controller/actions/runs/7762775568/job/21173786393.
The preview image can help surface the actual underlying issue. Once the drift issue is resolved, the helm-controller can be reverted to the previous version as that works fine, just the status reporting made it confusing.
Hi, Flux v2.2.3 has been released with https://github.com/fluxcd/helm-controller/pull/884 to help with the issue reported here. Instead of the test image I shared in the last comment, please upgrade to Flux v2.2.3 and see if it helps surface the potential drift detection and correction issue as described in detail above. The status wouldn't mention about drift explicitly yet but will show that the HelmRelease is being processed, not in a failed state. Please check the events of the particular HelmRelease and the logs, as documented in https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection, to see if they have conflict in drift correction that's causing the release to not complete successfully. In a future release, we may add explicit message about drift correction as described in https://github.com/fluxcd/helm-controller/pull/885 .