flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

HelmCharts resources stuck in READY state false after 0.28.0 upgrade

Open jhaumont opened this issue 3 years ago • 10 comments
trafficstars

Describe the bug

Hello

Context

After a long migration from Flux v1 to Flux v2, we are currently catching up Flux version from 0.27.1 to 0.32.0 (to be up to date 🚀 ). And we have to manage the source controller api version change in 0.28.0 version.

A small disclaimer about our setup: for source controller, we use a PVC to save the resources cache when the pod restart. Why ? Because in production, we have more than 300 gitrepositories + helmrepositories + 300 helmcharts. Because of this, pod takes a long time to fetch all the artifacts (more than 5 minutes) and finally, this triggers alerts because HelmRelease are in bad status.

What happened

We have an issue with HelmChart k8s resources when we deploy this specific version: they switch from READY state True to False, and never come back to a stable state. The only solution we have is to delete the HelmChart and do a reconcile on the parent HelmRelease. But, with other resources (gitrepo, helmrepo), we have no issues.

During our investigation, we have decided to destroy the PVC to start source-controller on a fresh one. And the magic happened: READY state of HelmCharts switched back to True!

So we suspect an issue with the data stored on the PVC for HelmCharts when CRD version changed.

Steps to reproduce

  1. Install flux version 0.27.1 (on kubernetes 1.22.6 for example)
  2. Setup a PVC to cache source controller resources data PVC yaml definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: flux-source-controller-cache
  namespace: flux-system
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: pd-ssd-retain

Kustomization patch to add the pvc to source controller

- op: replace
  path: /spec/template/spec/volumes/0
  value:
    name: data
    persistentVolumeClaim:
      claimName: flux-source-controller-cache
  1. Create an helmRelease for podinfo app
  2. Migrate Flux to 0.28.0 version

Expected behavior

HelmChart objects on kubernetes should have a READY state to True after 0.28.0 migration.

Screenshots and recordings

No response

OS / Distro

N/A

Flux version

0.28.0

Flux check

► checking prerequisites ✗ flux 0.29.0 <0.32.0 (new version is available, please upgrade) ✔ Kubernetes 1.22.6 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.18.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.21.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.17.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.22.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.22.1 ✔ all checks passed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

jhaumont avatar Aug 24 '22 14:08 jhaumont

Can you please bump the source objects in your repo to v1beta2 and upgrade to latest Flux.

stefanprodan avatar Aug 25 '22 18:08 stefanprodan

Hello @stefanprodan Thx for your answer 😃

We tested your solution:

  • Upgrade to 0.28
  • Upgrade all CRD to v1beta2
  • Upgrade to 0.32.0

Sadly, no improvements on helmChart status: they still be in READY=false status.

But now, we have error in source-controller logs!

Example:

{"level":"info","ts":"2022-08-30T08:29:54.335Z","logger":"controller.helmchart","msg":"artifact up-to-date with remote revision: '2.1.7'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"ep-sealed-secrets","namespace":"ep","reconcileID":"70415523-9018-4fec-88c2-f99c76281348"}
{"level":"error","ts":"2022-08-30T08:29:54.336Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"ep-sealed-secrets","namespace":"ep","error":"pulled 'sealed-secrets' chart with version '2.1.7'"}

So, like the 1st time, to fix it, we have to delete pvc & pv (to erase cache).

jhaumont avatar Aug 30 '22 08:08 jhaumont

Hi, can you share the full status of the HelmChart object? The status should have more information about why it's not ready. You can redact any sensitive information in there before sharing. Ready status will also have reasons for why it's not ready and must be accompanied by some other status conditions that are the actual reasons for the failure, like Reconciling, FetchFailed, Stalled, etc. There are some examples of the status in https://fluxcd.io/docs/components/source/helmcharts/#describe-the-helmchart that'll help understand the situation better.

darkowlzz avatar Aug 30 '22 09:08 darkowlzz

I tried this myself.

Created a new cluster (K8s v1.24.0) and installed flux v0.28.0 with the PVC patch. SC deployment event that shows the version (v0.22.1):

[event: pod flux-system/source-controller-5d895589f7-bg8zw] Pulling image "fluxcd/source-controller:v0.22.1"

PVC:

$ kubectl get pvc -A
NAMESPACE     NAME                           STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
flux-system   flux-source-controller-cache   Bound    pvc-a0856a3a-0b50-490f-9ee2-71202ecde400   10Gi       RWO            standard       7m7s

Pod description with PVC attached:

Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  flux-source-controller-cache
    ReadOnly:   false

Created HelmRepository and HelmRelease for podinfo.

HelmChart status before upgrade:

  conditions:
  - lastTransitionTime: "2022-08-30T22:48:06Z"
    message: pulled 'podinfo' chart with version '4.0.6'
    observedGeneration: 1
    reason: ChartPullSucceeded
    status: "True"
    type: Ready
  observedChartName: podinfo
  observedGeneration: 1
  observedSourceArtifactRevision: 1df61314875caa093830692c7b4f8c6d10cd5f9f5fafa46e926144e5d55ebac2
  url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/default-podinfo/latest.tar.gz

Upgraded to flux 0.32.0 (SC 0.26.1) by applying the new install manifest from github release artifacts.

HelmChart status after upgrade:

  conditions:
  - lastTransitionTime: "2022-08-30T22:48:06Z"
    message: pulled 'podinfo' chart with version '4.0.6'
    observedGeneration: 1
    reason: ChartPullSucceeded
    status: "True"
    type: Ready
  - lastTransitionTime: "2022-08-30T22:50:01Z"
    message: pulled 'podinfo' chart with version '4.0.6'
    observedGeneration: 1
    reason: ChartPullSucceeded
    status: "True"
    type: ArtifactInStorage
  observedChartName: podinfo
  observedGeneration: 1
  observedSourceArtifactRevision: 1df61314875caa093830692c7b4f8c6d10cd5f9f5fafa46e926144e5d55ebac2
  url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/default-podinfo/latest.tar.gz

Since we added new conditions after SC 0.22.1, there's new condition in the status but everything seems to be fine.

Flux 0.28.0 had the new source-controller which would contain helpful information about the reasons for failure. If the HelmChart isn't ready, the status will have more information about why it's not ready.

Maybe something is missing in the steps to reproduce other than the k8s version.

darkowlzz avatar Aug 30 '22 23:08 darkowlzz

Hello

Good news to hear this. Thx to you for your test. So with new flux version & new conditions, maybe we don't face the issue again when we use a PVC to keep the cache. Is that correct?

We start the upgrade from version 0.27.1 (source controller version 0.21.2), not from 0.28.0. Maybe the issue comes from this. With crd (gitrepo, helmrepo, helmchart) on v1beta1 version.

We have to do a last env for our flux upgrade. I will send you the status of helmChart objects after the upgrade to 0.28.0 version.

jhaumont avatar Aug 31 '22 12:08 jhaumont

So with new flux version & new conditions, maybe we don't face the issue again when we use a PVC to keep the cache. Is that correct?

@jhaumont I don't think the status conditions have any relation to PVC and we don't do anything specific to PVC. It should just work. Without knowing the full value of your Ready=False status, we can't make any conclusion about it.

darkowlzz avatar Sep 02 '22 06:09 darkowlzz

In the latest version of Flux, as @darkowlzz is trying to convey, there is enhanced information in the Status field where it should provide a reason that tells why it is not ready.

You will need to kubectl describe the HelmChart resource to get the full detail about the events and conditions, but it should be possible even in the Status detail to see exactly what has gone wrong. Are you able to find that detail?

kingdonb avatar Sep 07 '22 13:09 kingdonb

Hello Yes, we just wait the migration of our prod cluster to give you the information.

The output of kubectl get helmchart ep-podinfo -n flux-system:

NAME         CHART        VERSION   SOURCE KIND     SOURCE NAME   AGE    READY   STATUS
ep-podinfo   ep/podinfo   *         GitRepository   helm-charts   138d   False   packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'

The output of kubectl describe helmchart ep-podinfo -n flux-system:

Name:         ep-podinfo
Namespace:    flux-system
Labels:       <none>
Annotations:  <none>
API Version:  source.toolkit.fluxcd.io/v1beta1
Kind:         HelmChart
Metadata:
  Creation Timestamp:  2022-04-22T13:41:59Z
  Finalizers:
    finalizers.fluxcd.io
  Generation:        1
  Resource Version:  1599023395
  UID:               6cc4cb1b-7c11-4f7b-9fb7-0a390b988b2b
Spec:
  Chart:               ep/podinfo
  Interval:            1m0s
  Reconcile Strategy:  Revision
  Source Ref:
    Kind:   GitRepository
    Name:   helm-charts
  Version:  *
Status:
  Artifact:
    Checksum:          a013b377e8df10653490d4e518db84dc6f0e1faa4a2473b77930559ef258cd53
    Last Update Time:  2022-09-07T15:03:33Z
    Path:              helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
    Revision:          3.1.2+9715ce0bac8b
    Size:              4774
    URL:               http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
  Conditions:
    Last Transition Time:             2022-09-07T15:04:57Z
    Message:                          packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
    Observed Generation:              1
    Reason:                           NewChart
    Status:                           False
    Type:                             Ready
    Last Transition Time:             2022-09-07T15:04:57Z
    Message:                          packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
    Observed Generation:              1
    Reason:                           NewChart
    Status:                           True
    Type:                             ArtifactOutdated
  Observed Chart Name:                podinfo
  Observed Generation:                1
  Observed Source Artifact Revision:  master/9715ce0bac8bbc2392304706290b613dc0fde444
  URL:                                http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/latest.tar.gz
Events:
  Type    Reason                      Age                From               Message
  ----    ------                      ----               ----               -------
  Normal  GarbageCollectionSucceeded  16s                source-controller  garbage collected old artifacts

and the same command result after we remove PVC & PV to force source controller to download again charts:

Name:         ep-podinfo
Namespace:    flux-system
Labels:       <none>
Annotations:  <none>
API Version:  source.toolkit.fluxcd.io/v1beta2
Kind:         HelmChart
Metadata:
  Creation Timestamp:  2022-07-07T14:03:35Z
  Finalizers:
    finalizers.fluxcd.io
  Generation:        1
  Resource Version:  2601899554
  UID:               d7fb188a-9eeb-4f58-9aa7-a143a8fb896b
Spec:
  Chart:               ep/podinfo
  Interval:            1m0s
  Reconcile Strategy:  Revision
  Source Ref:
    Kind:   GitRepository
    Name:   helm-charts
  Version:  *
Status:
  Artifact:
    Checksum:          06ccc64907d7d27f49736f58a43721c02ebcc342db586e5d70550fffe5b5cf10
    Last Update Time:  2022-09-07T15:09:30Z
    Path:              helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
    Revision:          3.1.2+9715ce0bac8b
    Size:              4774
    URL:               http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
  Conditions:
    Last Transition Time:             2022-09-07T15:09:30Z
    Message:                          packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
    Observed Generation:              1
    Reason:                           ChartPackageSucceeded
    Status:                           True
    Type:                             Ready
    Last Transition Time:             2022-09-07T15:22:34Z
    Message:                          packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
    Observed Generation:              1
    Reason:                           ChartPackageSucceeded
    Status:                           True
    Type:                             ArtifactInStorage
  Observed Chart Name:                podinfo
  Observed Generation:                1
  Observed Source Artifact Revision:  master/9715ce0bac8bbc2392304706290b613dc0fde444
  URL:                                http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/latest.tar.gz
Events:
  Type    Reason                      Age                  From               Message
  ----    ------                      ----                 ----               -------
  Normal  GarbageCollectionSucceeded  21m                  source-controller  garbage collected old artifacts
  Normal  NoSourceArtifact            17m (x2 over 17m)    source-controller  no artifact available for GitRepository source 'helm-charts'
  Normal  ChartPackageSucceeded       16m                  source-controller  packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
  Normal  GarbageCollectionSucceeded  15m                  source-controller  garbage collected old artifacts
  Normal  ArtifactUpToDate            40s (x4 over 3m40s)  source-controller  artifact up-to-date with remote revision: '3.1.2+9715ce0bac8b'

jhaumont avatar Sep 07 '22 15:09 jhaumont

We have likely overseen something in the reconciliation logic when we optimized things to use (much) less memory, which now due to the presence of an existing artifact (from the PV) causes the reconciler to short circuit prematurely resulting in incomplete status information (which should not happen).

@darkowlzz will try to reproduce this, and see where it goes wrong. However, erasing the PV in production and making the controller fetch the artifacts again might be faster than his investigation and waiting for a patch. This is however up to you :-).

hiddeco avatar Sep 08 '22 18:09 hiddeco

@hiddeco totally agree and this is what we do in production :) I opened the ticket to be sure that will not happen again in a next flux upgrade 🤞

jhaumont avatar Sep 08 '22 18:09 jhaumont