flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

Upgrade from Flux 2.1.x to 2.2.2 leaves most HelmReleases in a broken state

Open wilmardo opened this issue 1 year ago • 12 comments

Describe the bug

It seems that after the upgrade some HelmReleases are migrated to the newer API object but some aren't. This won't go away unless the HelmReleases is removed or reconcile --force is used. Most obvious thing is that the message in flux get hr doesn't show the new information. The more breaking thing is that dependencies aren't considered ready even when dependency is Ready and the message shows Helm upgrade succeeded (see ingress-nginx and cert-manager in the output below for example).

Seems pretty similiar to what this PR is trying to solve: https://github.com/fluxcd/helm-controller/pull/850

Which should be in v2.2.2 where I am still having this issue.

flux get hr output right after the upgrade:

# flux get hr
NAME                    REVISION                SUSPENDED       READY   MESSAGE
azure-workload-identity 1.1.0                   False           True    Helm upgrade succeeded
cert-exporter           3.4.1                   False           True    Helm upgrade succeeded for release guida-system/cert-exporter.v4 with chart [email protected]
cert-manager            v1.13.1                 False           True    Helm upgrade succeeded
flux                    2.12.2                  False           True    Helm upgrade succeeded for release flux-system/flux.v5 with chart [email protected]
helm-exporter           1.2.11+7a3ebb3          False           True    Helm upgrade succeeded
ingress-nginx           4.8.3                   False           False   dependency 'flux-system/cert-manager' is not ready
kyverno                 3.1.1                   False           True    Helm upgrade succeeded
prometheus-operator     51.2.0                  False           True    Helm upgrade succeeded for release guida-system/prometheus-operator.v3 with chart [email protected]
rbac-manager            1.17.6                  False           True    Helm upgrade succeeded
sealed-secrets          2.13.0                  False           True    Helm upgrade succeeded
velero                  5.0.2                   False           True    Helm upgrade succeeded

All the releases showing Helm upgrade succeeded or dependency 'flux-system/xxx' is not ready won't go to the new message without a --force or deletion.

I tried:

  • Adding an annotation and migrating to v2beta2 for all the HelmReleases as described here: https://github.com/fluxcd/helm-controller/pull/850#pullrequestreview-1782152094
apiVersion: helm.toolkit.fluxcd.io/v2beta2
kind: HelmRelease
metadata:
  annotations:
    fluxcd.io/upgradeTo: v2beta2

  • Enabled driftDetection on the HelmReleases after the above as suggested here: https://github.com/fluxcd/flux2/issues/4511#issuecomment-1867618282
  driftDetection:
    ignore:
    - paths:
      - /spec/replicas
      target:
        kind: Deployment
    mode: enabled

It would be extremely nice if the upgrade could be autonomous and does not require human intervention to run reconcile --force of all HelmReleases. The --force will break in some occasions as well (AWS with an NLB on a Service for example).

Steps to reproduce

  1. Have Flux 2.1.1 running on the cluster with several HelmReleases
  2. Upgrade Flux to 2.2.2
  3. See the output of flux get hr with different messages and stuck dependencies

Expected behavior

All the HelmReleases to show the new message and being accepted as ready

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

v2.2.2

Flux check

► checking prerequisites ✔ Kubernetes 1.27.5+k3s1 >=1.26.0-0 ► checking version in cluster ✔ distribution: flux-2.2.2 ✔ bootstrapped: false ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.37.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v1.2.1 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v1.2.3 ► checking crds ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

Reconcile log of a 'stuck' HelmRelease:

{"level":"info","ts":"2024-01-03T15:34:41.074Z","msg":"HelmChart/flux-system/flux-system-azure-workload-identity with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}
{"level":"info","ts":"2024-01-03T15:34:41.299Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"azure-workload-identity","namespace":"flux-system"},"namespace":"flux-system","name":"azure-workload-identity","reconcileID":"fc1561fa-9ef1-42b9-8578-21e9585b9ff4"}

Reconcile log of a update HelmRelease:

{"level":"info","ts":"2024-01-03T15:36:07.133Z","msg":"HelmChart/flux-system/flux-system-cert-exporter with SourceRef 'HelmRepository/flux-system/guida-mirror' is in-sync","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}
{"level":"info","ts":"2024-01-03T15:36:07.348Z","msg":"release in-sync with desired state","controller":"helmrelease","controllerGroup":"helm.toolkit.fluxcd.io","controllerKind":"HelmRelease","HelmRelease":{"name":"cert-exporter","namespace":"flux-system"},"namespace":"flux-system","name":"cert-exporter","reconcileID":"5677d347-ebf3-48a5-8c98-3caa26b24dc9"}

All seems happy in the helm-controller to me :)

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

wilmardo avatar Jan 03 '24 15:01 wilmardo

Let me know If I can provide some more information. I can easily recreate this behavior on my local cluster consistently.

wilmardo avatar Jan 03 '24 15:01 wilmardo

I can confirm that this also happened to me even with the 2.2.2 release.

siegenthalerroger avatar Jan 04 '24 12:01 siegenthalerroger

This might be related: https://github.com/fluxcd/flux2/issues/4529

wilmardo avatar Jan 09 '24 12:01 wilmardo

This is probably related:

# flux reconcile helmrelease rabbitmq
✗ failed to get API group resources: unable to retrieve the complete list of server APIs: helm.toolkit.fluxcd.io/v2beta2: the server could not find the requested resource

the thing is, my helmrelease has apiVersion v2beta1 not v2beta2 and my check command does not even show beta2:

# flux check
► checking prerequisites
✔ Kubernetes 1.26.6+k3s-e18037a7-dirty >=1.26.0-0
► checking version in cluster
✔ distribution: flux-v2.1.0
✔ bootstrapped: true
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.36.0
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.1.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.1.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.1.0
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

Not sure why it says distribution-2.1.0, because I have:

# flux --version
flux version 2.2.2

After reverting to flux 2.1.0, everything works again.

razvanphp avatar Jan 14 '24 20:01 razvanphp

@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2. The 2.2.x CLI isn't backwards compatible with the 2.1.x release:

Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version. https://github.com/fluxcd/flux2/issues/4490#issuecomment-1858614360

wilmardo avatar Jan 15 '24 11:01 wilmardo

Can you please post here the kubectl get hr -o yaml --show-managed-fields for cert-manager or any of the dependant HRs.

stefanprodan avatar Jan 18 '24 10:01 stefanprodan

@razvanphp Your issue isn't related perse. The error you see is, because your controller in the cluster is still at 2.1.0 and your CLI has been updated to 2.2.2. The 2.2.x CLI isn't backwards compatible with the 2.1.x release:

Yes, the CLI ensures backwards compatibility only for GA APIs, for beta versions you need a CLI that matches the cluster version. #4490 (comment)

Indeed, thank you for your answer! Sorry for the noob question...

razvanphp avatar Jan 18 '24 21:01 razvanphp

@wilmardo can you please provide some detailed instructions to reproduce this issue? Based on your issue description, I tried a few things but couldn't reproduce it. Some detailed steps with example configuration or even a test repository with just the necessary configurations to help reproduce it would be very helpful.

darkowlzz avatar Jan 25 '24 16:01 darkowlzz

Yes! Will get back to this, busy with other thing at the moment and we postponed this update for now. Hopefully in the beginning of next week I have more time to gather info and reproduce the issue again.

@darkowlzz Will try to get something together but I don't know if it might be something very specific to our in-house stuff that is triggering this.

wilmardo avatar Jan 25 '24 16:01 wilmardo

This might be related although the issue is a bit vague: https://github.com/fluxcd/helm-controller/issues/891

wilmardo avatar Jan 30 '24 08:01 wilmardo

Hi, we got another report of a similar issue today on slack and that revealed some helpful hints to the issue. I created a potential theory for what's causing this and some potential solutions for it. Refer https://github.com/fluxcd/helm-controller/pull/884 and https://github.com/fluxcd/helm-controller/pull/885 for details about it.

I can briefly explain the observations here too. The "dependency is not ready" may not be the actual issue here. It's more likely that the reconciliation failed once with this error and on a subsequent reconciliation it went past the dependency check but the old Ready status persisted on the object and reconciliation entered a drift detection and correction loop due to some other controller/entity in the cluster which reverted/modified the configurations applied by the helmrelease. https://github.com/fluxcd/helm-controller/issues/855 is an example of this situation and how it can be handled using drift detection ignore rules. Refer https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection for detailed docs. Another way to verify the issue would be to look at the events and logs associated with the HelmRelease. They should mention about the drift. Debug level logs must be enabled to see the details about the detected drift, as described in the docs.

I've shared some more details about my attempts to reproduce this issue in https://github.com/fluxcd/helm-controller/pull/885#issuecomment-1924934731. Based on that, I think the changes in https://github.com/fluxcd/helm-controller/pull/885 should make the situation better and surface the actual issue. It would be great if people who are facing this issue can try the preview image of that PR using

ghcr.io/fluxcd/helm-controller:preview-ac9e62ad@sha256:b3d9cc5e440f0b8ed83c1d5832c6f49a7e648f70e8093e85595902cd4891b9b3

It's an official preview image built using the flux release infrastructure, refer https://github.com/fluxcd/helm-controller/actions/runs/7762775568/job/21173786393.

The preview image can help surface the actual underlying issue. Once the drift issue is resolved, the helm-controller can be reverted to the previous version as that works fine, just the status reporting made it confusing.

darkowlzz avatar Feb 03 '24 00:02 darkowlzz

Hi, Flux v2.2.3 has been released with https://github.com/fluxcd/helm-controller/pull/884 to help with the issue reported here. Instead of the test image I shared in the last comment, please upgrade to Flux v2.2.3 and see if it helps surface the potential drift detection and correction issue as described in detail above. The status wouldn't mention about drift explicitly yet but will show that the HelmRelease is being processed, not in a failed state. Please check the events of the particular HelmRelease and the logs, as documented in https://fluxcd.io/flux/components/helm/helmreleases/#drift-detection, to see if they have conflict in drift correction that's causing the release to not complete successfully. In a future release, we may add explicit message about drift correction as described in https://github.com/fluxcd/helm-controller/pull/885 .

darkowlzz avatar Feb 05 '24 15:02 darkowlzz