flux2
flux2 copied to clipboard
flux reconcile kustomization CLI command times out, interval reconcile runs 20x slower than it used to, but reconcile helmrelease & source cli commands work fine
Describe the bug
We recently migrated to the AWS VPC CNI from weave-net on our EKS cluster, and the only thing not working now is the CLI flux reconcile kustomization command. It gets so far as annotating the kustomization and then just hangs for as long as the timeout is set. I've tried up to a 60 minute timeout and get the same result. It previously would reconcile within 15 seconds.
Running flux reconcile helmrelease or flux reconcile source git both work within seconds. We're not seeing anything useful in the logs to indicate what the issue could be. Additionally, the flux interval based kustomization reconciliation runs, but a lot slower than it used to.
I've also confirmed that I can git clone the repo it's supposed to reconcile from from within an alpine based pod in the flux namespace.
Steps to reproduce
- Install flux on an EKS cluster running the AWS VPC CNI
- Create a flux kustomization resource
- Run flux reconcile kustomization
--timeout 60m - See error
> flux reconcile kustomization <name> --timeout 60m
► annotating Kustomization <name> in flux namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
client rate limiter Wait returned an error: context deadline exceeded
Expected behavior
Reconciles within seconds
Screenshots and recordings
No response
OS / Distro
Amazon Linux 2 (EKS latest recommended AMI)
Flux version
flux: v0.41.2
Flux check
flux check -n flux ► checking prerequisites ✗ flux 0.41.2 <2.0.0-rc.5 (new version is available, please upgrade) ✔ Kubernetes 1.23.17-eks-0a21954 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.31.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.35.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.33.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.36.1 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed
Git provider
Github
Container Registry provider
No response
Additional context
flux logs command returns a blank line and nothing else. (also tried flux logs -n flux & flux logs -A to no avail)
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
what do you see when you run flux get ks <name>
what do you see when you run
flux get ks <name>
➜ ~ flux -n flux get ks gitops-deploy
NAME REVISION SUSPENDED READY MESSAGE
gitops-deploy main@sha1:cd18a932 False Unknown Reconciliation in progress
➜ ~ k get -n flux ks gitops-deploy
NAME AGE READY STATUS
gitops-deploy 5h38m Unknown Reconciliation in progress
Can you check the events of the kustomization and logs of the kustomize-controller
flux events --for Kustomization/<name>
kubectl logs -f deploy/kustomize-controller
Can you check the events of the kustomization and logs of the kustomize-controller
flux events --for Kustomization/<name>
kubectl logs -f deploy/kustomize-controller
> flux events --for Kustomization/gitops-deploy -n flux
LAST SEEN TYPE REASON OBJECT MESSAGE
55m Normal ReconciliationSucceeded Kustomization/gitops-deploy Reconciliation finished in 13m58.850691243s, next run in 1m0s
43m (x37 over 3h44m) Normal GitOperationSucceeded GitRepository/gitops-deploy no changes since last reconcilation: observed revision 'main@sha1:aee27b8ea585271ff52ba8fd1578be6605b3033b'
41m Normal ReconciliationSucceeded Kustomization/gitops-deploy Reconciliation finished in 14m0.045854357s, next run in 1m0s
28m (x10 over 8h) Normal GarbageCollectionSucceeded GitRepository/gitops-deploy garbage collected 1 artifacts
27m Normal ReconciliationSucceeded Kustomization/gitops-deploy Reconciliation finished in 13m58.479628828s, next run in 1m0s
13m (x34 over 8h) Normal Progressing Kustomization/gitops-deploy ValidatingWebhookConfiguration/elastic-webhook.k8s.elastic.co configured
13m Normal ReconciliationSucceeded Kustomization/gitops-deploy Reconciliation finished in 13m59.029745298s, next run in 1m0s
3m47s (x6 over 28m) Normal GitOperationSucceeded GitRepository/gitops-deploy no changes since last reconcilation: observed revision 'main@sha1:9d778daf6997d967735ed7ee928ebdef09959e8a'
Here's the logs, they're pretty extensive, so I put them in a zip instead of copy pasting. flux-logs.json.zip
Digging a little bit more this morning using the source code as my guide...
Examining the source code of the flux cli, a reconcile request annotates the ks object, a mutation which I assume (still digging into the code here) triggers the kustomization-controller to run the reconcile. It then polls the ks object until the Ready status is not "Unknown", and the lastHandledReconcileAt field is not the same as it was when pulled prior to making the annotation.
In our case, both of these conditions are failing -- there is a 60s interval set up on the ks, so it is immediately launching into the next scheduled reconciliation after completing the prior one, so the CLI never sees any Ready status but "Unknown". Additionally, lastHandledReconcileAt is not being updated at all:
➜ ~ k get -o yaml -n flux ks gitops-deploy -o jsonpath='{.metadata.annotations}{.status.conditions}{.status.lastHandledReconcileAt}'
{"reconcile.fluxcd.io/requestedAt":"2023-06-22T10:15:37.969587-05:00"}[{"lastTransitionTime":"2023-06-22T15:26:19Z","message":"Detecting drift for revision main@sha1:dec61c161a1751d42e8575a10f6e47be8a1abd84 with a timeout of 30s","observedGeneration":1,"reason":"Progressing","status":"True","type":"Reconciling"},{"lastTransitionTime":"2023-06-22T15:26:16Z","message":"Reconciliation in progress","observedGeneration":1,"reason":"Progressing","status":"Unknown","type":"Ready"}]2023-06-21T17:57:14.840845-07:00%
➜ ~ flux events -n flux --for Kustomization/gitops-deploy --watch
LAST SEEN TYPE REASON OBJECT MESSAGE
...irrelevant...
10s Normal ReconciliationSucceeded Kustomization/gitops-deploy Reconciliation finished in 13m59.847435288s, next run in 1m0s
I can understand why it's immediately moving into the next reconciliation, and I'm going to back the interval off and see if that stabilizes things. It doesn't, however, explain why lastHandledReconcileAt is not being updated.
One more note from our end, we have updated the cluster to the latest pre-v2-rc version, v0.41.2, and our clients are also now up to that version.
Backing the interval off by directly editing the ks allowed for the lastHandleReconcileAt field to be set correctly. It seems that launching immediately into the next reconcile causes a conflict in the kustomize controller trying to update that field?
@e3b0c442 , Sorry for the long pause.
It doesn't, however, explain why lastHandledReconcileAt is not being updated.
lastHandledReconcileAt
is updated on every reconcile run unless the annotation isn't present.
From the command you posted earlier, the Kustomization has a status of progressing, which means the reconciliation is still in progress(so the lastHandledReconcile hasn't been recorded yet).