flux2
flux2 copied to clipboard
HelmCharts resources stuck in READY state false after 0.28.0 upgrade
Describe the bug
Hello
Context
After a long migration from Flux v1 to Flux v2, we are currently catching up Flux version from 0.27.1 to 0.32.0 (to be up to date 🚀 ).
And we have to manage the source controller api version change in 0.28.0 version.
A small disclaimer about our setup: for source controller, we use a PVC to save the resources cache when the pod restart. Why ? Because in production, we have more than 300 gitrepositories + helmrepositories + 300 helmcharts. Because of this, pod takes a long time to fetch all the artifacts (more than 5 minutes) and finally, this triggers alerts because HelmRelease are in bad status.
What happened
We have an issue with HelmChart k8s resources when we deploy this specific version: they switch from READY state True to False, and never come back to a stable state. The only solution we have is to delete the HelmChart and do a reconcile on the parent HelmRelease. But, with other resources (gitrepo, helmrepo), we have no issues.
During our investigation, we have decided to destroy the PVC to start source-controller on a fresh one. And the magic happened: READY state of HelmCharts switched back to True!
So we suspect an issue with the data stored on the PVC for HelmCharts when CRD version changed.
Steps to reproduce
- Install flux version 0.27.1 (on kubernetes 1.22.6 for example)
- Setup a PVC to cache source controller resources data PVC yaml definition
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: flux-source-controller-cache
namespace: flux-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: pd-ssd-retain
Kustomization patch to add the pvc to source controller
- op: replace
path: /spec/template/spec/volumes/0
value:
name: data
persistentVolumeClaim:
claimName: flux-source-controller-cache
- Create an helmRelease for podinfo app
- Migrate Flux to 0.28.0 version
Expected behavior
HelmChart objects on kubernetes should have a READY state to True after 0.28.0 migration.
Screenshots and recordings
No response
OS / Distro
N/A
Flux version
0.28.0
Flux check
► checking prerequisites ✗ flux 0.29.0 <0.32.0 (new version is available, please upgrade) ✔ Kubernetes 1.22.6 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.18.0 ✔ image-automation-controller: deployment ready ► ghcr.io/fluxcd/image-automation-controller:v0.21.0 ✔ image-reflector-controller: deployment ready ► ghcr.io/fluxcd/image-reflector-controller:v0.17.0 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.22.0 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.23.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.22.1 ✔ all checks passed
Git provider
No response
Container Registry provider
No response
Additional context
No response
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Can you please bump the source objects in your repo to v1beta2 and upgrade to latest Flux.
Hello @stefanprodan Thx for your answer 😃
We tested your solution:
- Upgrade to 0.28
- Upgrade all CRD to v1beta2
- Upgrade to 0.32.0
Sadly, no improvements on helmChart status: they still be in READY=false status.
But now, we have error in source-controller logs!
Example:
{"level":"info","ts":"2022-08-30T08:29:54.335Z","logger":"controller.helmchart","msg":"artifact up-to-date with remote revision: '2.1.7'","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"ep-sealed-secrets","namespace":"ep","reconcileID":"70415523-9018-4fec-88c2-f99c76281348"}
{"level":"error","ts":"2022-08-30T08:29:54.336Z","logger":"controller.helmchart","msg":"Reconciler error","reconciler group":"source.toolkit.fluxcd.io","reconciler kind":"HelmChart","name":"ep-sealed-secrets","namespace":"ep","error":"pulled 'sealed-secrets' chart with version '2.1.7'"}
So, like the 1st time, to fix it, we have to delete pvc & pv (to erase cache).
Hi, can you share the full status of the HelmChart object? The status should have more information about why it's not ready. You can redact any sensitive information in there before sharing. Ready status will also have reasons for why it's not ready and must be accompanied by some other status conditions that are the actual reasons for the failure, like Reconciling, FetchFailed, Stalled, etc. There are some examples of the status in https://fluxcd.io/docs/components/source/helmcharts/#describe-the-helmchart that'll help understand the situation better.
I tried this myself.
Created a new cluster (K8s v1.24.0) and installed flux v0.28.0 with the PVC patch. SC deployment event that shows the version (v0.22.1):
[event: pod flux-system/source-controller-5d895589f7-bg8zw] Pulling image "fluxcd/source-controller:v0.22.1"
PVC:
$ kubectl get pvc -A
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
flux-system flux-source-controller-cache Bound pvc-a0856a3a-0b50-490f-9ee2-71202ecde400 10Gi RWO standard 7m7s
Pod description with PVC attached:
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: flux-source-controller-cache
ReadOnly: false
Created HelmRepository and HelmRelease for podinfo.
HelmChart status before upgrade:
conditions:
- lastTransitionTime: "2022-08-30T22:48:06Z"
message: pulled 'podinfo' chart with version '4.0.6'
observedGeneration: 1
reason: ChartPullSucceeded
status: "True"
type: Ready
observedChartName: podinfo
observedGeneration: 1
observedSourceArtifactRevision: 1df61314875caa093830692c7b4f8c6d10cd5f9f5fafa46e926144e5d55ebac2
url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/default-podinfo/latest.tar.gz
Upgraded to flux 0.32.0 (SC 0.26.1) by applying the new install manifest from github release artifacts.
HelmChart status after upgrade:
conditions:
- lastTransitionTime: "2022-08-30T22:48:06Z"
message: pulled 'podinfo' chart with version '4.0.6'
observedGeneration: 1
reason: ChartPullSucceeded
status: "True"
type: Ready
- lastTransitionTime: "2022-08-30T22:50:01Z"
message: pulled 'podinfo' chart with version '4.0.6'
observedGeneration: 1
reason: ChartPullSucceeded
status: "True"
type: ArtifactInStorage
observedChartName: podinfo
observedGeneration: 1
observedSourceArtifactRevision: 1df61314875caa093830692c7b4f8c6d10cd5f9f5fafa46e926144e5d55ebac2
url: http://source-controller.flux-system.svc.cluster.local./helmchart/default/default-podinfo/latest.tar.gz
Since we added new conditions after SC 0.22.1, there's new condition in the status but everything seems to be fine.
Flux 0.28.0 had the new source-controller which would contain helpful information about the reasons for failure. If the HelmChart isn't ready, the status will have more information about why it's not ready.
Maybe something is missing in the steps to reproduce other than the k8s version.
Hello
Good news to hear this. Thx to you for your test. So with new flux version & new conditions, maybe we don't face the issue again when we use a PVC to keep the cache. Is that correct?
We start the upgrade from version 0.27.1 (source controller version 0.21.2), not from 0.28.0. Maybe the issue comes from this. With crd (gitrepo, helmrepo, helmchart) on v1beta1 version.
We have to do a last env for our flux upgrade. I will send you the status of helmChart objects after the upgrade to 0.28.0 version.
So with new flux version & new conditions, maybe we don't face the issue again when we use a PVC to keep the cache. Is that correct?
@jhaumont I don't think the status conditions have any relation to PVC and we don't do anything specific to PVC. It should just work. Without knowing the full value of your Ready=False status, we can't make any conclusion about it.
In the latest version of Flux, as @darkowlzz is trying to convey, there is enhanced information in the Status field where it should provide a reason that tells why it is not ready.
You will need to kubectl describe the HelmChart resource to get the full detail about the events and conditions, but it should be possible even in the Status detail to see exactly what has gone wrong. Are you able to find that detail?
Hello Yes, we just wait the migration of our prod cluster to give you the information.
The output of kubectl get helmchart ep-podinfo -n flux-system:
NAME CHART VERSION SOURCE KIND SOURCE NAME AGE READY STATUS
ep-podinfo ep/podinfo * GitRepository helm-charts 138d False packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
The output of kubectl describe helmchart ep-podinfo -n flux-system:
Name: ep-podinfo
Namespace: flux-system
Labels: <none>
Annotations: <none>
API Version: source.toolkit.fluxcd.io/v1beta1
Kind: HelmChart
Metadata:
Creation Timestamp: 2022-04-22T13:41:59Z
Finalizers:
finalizers.fluxcd.io
Generation: 1
Resource Version: 1599023395
UID: 6cc4cb1b-7c11-4f7b-9fb7-0a390b988b2b
Spec:
Chart: ep/podinfo
Interval: 1m0s
Reconcile Strategy: Revision
Source Ref:
Kind: GitRepository
Name: helm-charts
Version: *
Status:
Artifact:
Checksum: a013b377e8df10653490d4e518db84dc6f0e1faa4a2473b77930559ef258cd53
Last Update Time: 2022-09-07T15:03:33Z
Path: helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
Revision: 3.1.2+9715ce0bac8b
Size: 4774
URL: http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
Conditions:
Last Transition Time: 2022-09-07T15:04:57Z
Message: packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
Observed Generation: 1
Reason: NewChart
Status: False
Type: Ready
Last Transition Time: 2022-09-07T15:04:57Z
Message: packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
Observed Generation: 1
Reason: NewChart
Status: True
Type: ArtifactOutdated
Observed Chart Name: podinfo
Observed Generation: 1
Observed Source Artifact Revision: master/9715ce0bac8bbc2392304706290b613dc0fde444
URL: http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/latest.tar.gz
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal GarbageCollectionSucceeded 16s source-controller garbage collected old artifacts
and the same command result after we remove PVC & PV to force source controller to download again charts:
Name: ep-podinfo
Namespace: flux-system
Labels: <none>
Annotations: <none>
API Version: source.toolkit.fluxcd.io/v1beta2
Kind: HelmChart
Metadata:
Creation Timestamp: 2022-07-07T14:03:35Z
Finalizers:
finalizers.fluxcd.io
Generation: 1
Resource Version: 2601899554
UID: d7fb188a-9eeb-4f58-9aa7-a143a8fb896b
Spec:
Chart: ep/podinfo
Interval: 1m0s
Reconcile Strategy: Revision
Source Ref:
Kind: GitRepository
Name: helm-charts
Version: *
Status:
Artifact:
Checksum: 06ccc64907d7d27f49736f58a43721c02ebcc342db586e5d70550fffe5b5cf10
Last Update Time: 2022-09-07T15:09:30Z
Path: helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
Revision: 3.1.2+9715ce0bac8b
Size: 4774
URL: http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/podinfo-3.1.2+9715ce0bac8b.tgz
Conditions:
Last Transition Time: 2022-09-07T15:09:30Z
Message: packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
Observed Generation: 1
Reason: ChartPackageSucceeded
Status: True
Type: Ready
Last Transition Time: 2022-09-07T15:22:34Z
Message: packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
Observed Generation: 1
Reason: ChartPackageSucceeded
Status: True
Type: ArtifactInStorage
Observed Chart Name: podinfo
Observed Generation: 1
Observed Source Artifact Revision: master/9715ce0bac8bbc2392304706290b613dc0fde444
URL: http://source-controller.flux-system.svc.cluster.local./helmchart/flux-system/ep-podinfo/latest.tar.gz
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal GarbageCollectionSucceeded 21m source-controller garbage collected old artifacts
Normal NoSourceArtifact 17m (x2 over 17m) source-controller no artifact available for GitRepository source 'helm-charts'
Normal ChartPackageSucceeded 16m source-controller packaged 'podinfo' chart with version '3.1.2+9715ce0bac8b'
Normal GarbageCollectionSucceeded 15m source-controller garbage collected old artifacts
Normal ArtifactUpToDate 40s (x4 over 3m40s) source-controller artifact up-to-date with remote revision: '3.1.2+9715ce0bac8b'
We have likely overseen something in the reconciliation logic when we optimized things to use (much) less memory, which now due to the presence of an existing artifact (from the PV) causes the reconciler to short circuit prematurely resulting in incomplete status information (which should not happen).
@darkowlzz will try to reproduce this, and see where it goes wrong. However, erasing the PV in production and making the controller fetch the artifacts again might be faster than his investigation and waiting for a patch. This is however up to you :-).
@hiddeco totally agree and this is what we do in production :) I opened the ticket to be sure that will not happen again in a next flux upgrade 🤞