flux2 icon indicating copy to clipboard operation
flux2 copied to clipboard

flux reconcile kustomization CLI command times out, interval reconcile runs 20x slower than it used to, but reconcile helmrelease & source cli commands work fine

Open chase-replicated opened this issue 1 year ago • 8 comments

Describe the bug

We recently migrated to the AWS VPC CNI from weave-net on our EKS cluster, and the only thing not working now is the CLI flux reconcile kustomization command. It gets so far as annotating the kustomization and then just hangs for as long as the timeout is set. I've tried up to a 60 minute timeout and get the same result. It previously would reconcile within 15 seconds.

Running flux reconcile helmrelease or flux reconcile source git both work within seconds. We're not seeing anything useful in the logs to indicate what the issue could be. Additionally, the flux interval based kustomization reconciliation runs, but a lot slower than it used to.

I've also confirmed that I can git clone the repo it's supposed to reconcile from from within an alpine based pod in the flux namespace.

Steps to reproduce

  1. Install flux on an EKS cluster running the AWS VPC CNI
  2. Create a flux kustomization resource
  3. Run flux reconcile kustomization --timeout 60m
  4. See error
> flux reconcile kustomization <name> --timeout 60m
► annotating Kustomization <name> in flux namespace
✔ Kustomization annotated
◎ waiting for Kustomization reconciliation
client rate limiter Wait returned an error: context deadline exceeded

Expected behavior

Reconciles within seconds

Screenshots and recordings

No response

OS / Distro

Amazon Linux 2 (EKS latest recommended AMI)

Flux version

flux: v0.41.2

Flux check

flux check -n flux ► checking prerequisites ✗ flux 0.41.2 <2.0.0-rc.5 (new version is available, please upgrade) ✔ Kubernetes 1.23.17-eks-0a21954 >=1.20.6-0 ► checking controllers ✔ helm-controller: deployment ready ► ghcr.io/fluxcd/helm-controller:v0.31.2 ✔ kustomize-controller: deployment ready ► ghcr.io/fluxcd/kustomize-controller:v0.35.1 ✔ notification-controller: deployment ready ► ghcr.io/fluxcd/notification-controller:v0.33.0 ✔ source-controller: deployment ready ► ghcr.io/fluxcd/source-controller:v0.36.1 ► checking crds ✔ alerts.notification.toolkit.fluxcd.io/v1beta1 ✔ buckets.source.toolkit.fluxcd.io/v1beta2 ✔ gitrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ helmcharts.source.toolkit.fluxcd.io/v1beta2 ✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1 ✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2 ✔ kustomizations.kustomize.toolkit.fluxcd.io/v1beta2 ✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2 ✔ providers.notification.toolkit.fluxcd.io/v1beta1 ✔ receivers.notification.toolkit.fluxcd.io/v1beta1 ✔ all checks passed

Git provider

Github

Container Registry provider

No response

Additional context

flux logs command returns a blank line and nothing else. (also tried flux logs -n flux & flux logs -A to no avail)

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

chase-replicated avatar Jun 21 '23 19:06 chase-replicated

what do you see when you run flux get ks <name>

somtochiama avatar Jun 21 '23 19:06 somtochiama

what do you see when you run flux get ks <name>

➜  ~ flux -n flux get ks gitops-deploy
NAME         	REVISION          	SUSPENDED	READY  	MESSAGE                    
gitops-deploy	main@sha1:cd18a932	False    	Unknown	Reconciliation in progress	

➜  ~ k get -n flux ks gitops-deploy     
NAME            AGE     READY     STATUS
gitops-deploy   5h38m   Unknown   Reconciliation in progress

e3b0c442 avatar Jun 21 '23 19:06 e3b0c442

Can you check the events of the kustomization and logs of the kustomize-controller

flux events --for Kustomization/<name>
kubectl logs -f deploy/kustomize-controller

somtochiama avatar Jun 21 '23 20:06 somtochiama

Can you check the events of the kustomization and logs of the kustomize-controller

flux events --for Kustomization/<name>
kubectl logs -f deploy/kustomize-controller
> flux events --for Kustomization/gitops-deploy -n flux
LAST SEEN           	TYPE  	REASON                    	OBJECT                     	MESSAGE
55m                 	Normal	ReconciliationSucceeded   	Kustomization/gitops-deploy	Reconciliation finished in 13m58.850691243s, next run in 1m0s                                              	
43m (x37 over 3h44m)	Normal	GitOperationSucceeded     	GitRepository/gitops-deploy	no changes since last reconcilation: observed revision 'main@sha1:aee27b8ea585271ff52ba8fd1578be6605b3033b'	
41m                 	Normal	ReconciliationSucceeded   	Kustomization/gitops-deploy	Reconciliation finished in 14m0.045854357s, next run in 1m0s                                               	
28m (x10 over 8h)   	Normal	GarbageCollectionSucceeded	GitRepository/gitops-deploy	garbage collected 1 artifacts                                                                              	
27m                 	Normal	ReconciliationSucceeded   	Kustomization/gitops-deploy	Reconciliation finished in 13m58.479628828s, next run in 1m0s                                              	
13m (x34 over 8h)   	Normal	Progressing               	Kustomization/gitops-deploy	ValidatingWebhookConfiguration/elastic-webhook.k8s.elastic.co configured                                   	
13m                 	Normal	ReconciliationSucceeded   	Kustomization/gitops-deploy	Reconciliation finished in 13m59.029745298s, next run in 1m0s                                              	
3m47s (x6 over 28m) 	Normal	GitOperationSucceeded     	GitRepository/gitops-deploy	no changes since last reconcilation: observed revision 'main@sha1:9d778daf6997d967735ed7ee928ebdef09959e8a'	

Here's the logs, they're pretty extensive, so I put them in a zip instead of copy pasting. flux-logs.json.zip

chase-replicated avatar Jun 22 '23 01:06 chase-replicated

Digging a little bit more this morning using the source code as my guide...

Examining the source code of the flux cli, a reconcile request annotates the ks object, a mutation which I assume (still digging into the code here) triggers the kustomization-controller to run the reconcile. It then polls the ks object until the Ready status is not "Unknown", and the lastHandledReconcileAt field is not the same as it was when pulled prior to making the annotation.

In our case, both of these conditions are failing -- there is a 60s interval set up on the ks, so it is immediately launching into the next scheduled reconciliation after completing the prior one, so the CLI never sees any Ready status but "Unknown". Additionally, lastHandledReconcileAt is not being updated at all:

➜  ~ k get -o yaml -n flux ks gitops-deploy -o jsonpath='{.metadata.annotations}{.status.conditions}{.status.lastHandledReconcileAt}'
{"reconcile.fluxcd.io/requestedAt":"2023-06-22T10:15:37.969587-05:00"}[{"lastTransitionTime":"2023-06-22T15:26:19Z","message":"Detecting drift for revision main@sha1:dec61c161a1751d42e8575a10f6e47be8a1abd84 with a timeout of 30s","observedGeneration":1,"reason":"Progressing","status":"True","type":"Reconciling"},{"lastTransitionTime":"2023-06-22T15:26:16Z","message":"Reconciliation in progress","observedGeneration":1,"reason":"Progressing","status":"Unknown","type":"Ready"}]2023-06-21T17:57:14.840845-07:00% 
➜  ~ flux events -n flux --for Kustomization/gitops-deploy --watch
LAST SEEN         	TYPE  	REASON     	OBJECT                     	MESSAGE                                                                  
...irrelevant...
10s	Normal	ReconciliationSucceeded	Kustomization/gitops-deploy	Reconciliation finished in 13m59.847435288s, next run in 1m0s

I can understand why it's immediately moving into the next reconciliation, and I'm going to back the interval off and see if that stabilizes things. It doesn't, however, explain why lastHandledReconcileAt is not being updated.

e3b0c442 avatar Jun 22 '23 15:06 e3b0c442

One more note from our end, we have updated the cluster to the latest pre-v2-rc version, v0.41.2, and our clients are also now up to that version.

e3b0c442 avatar Jun 22 '23 15:06 e3b0c442

Backing the interval off by directly editing the ks allowed for the lastHandleReconcileAt field to be set correctly. It seems that launching immediately into the next reconcile causes a conflict in the kustomize controller trying to update that field?

e3b0c442 avatar Jun 22 '23 15:06 e3b0c442

@e3b0c442 , Sorry for the long pause.

It doesn't, however, explain why lastHandledReconcileAt is not being updated.

lastHandledReconcileAt is updated on every reconcile run unless the annotation isn't present. From the command you posted earlier, the Kustomization has a status of progressing, which means the reconciliation is still in progress(so the lastHandledReconcile hasn't been recorded yet).

somtochiama avatar Jul 10 '23 10:07 somtochiama