helm-controller failed to wait for object to sync in-cache after patching context deadline exceeded

wut? really, what does it mean? why there's no other logs that describe what's going on?

2024-07-01T20:04:53.761Z info HelmRelease/something.flux-system - release out-of-sync with desired state: release config values changed 
2024-07-01T20:04:53.791Z info HelmRelease/something.flux-system - running 'upgrade' action with timeout of 5m0s 
2024-07-01T20:04:54.720Z info HelmRelease/something.flux-system - release is in a failed state 
2024-07-01T20:04:54.789Z info HelmRelease/something.flux-system - running 'rollback' action with timeout of 5m0s 
2024-07-01T20:05:05.069Z error HelmRelease/something.flux-system - failed to wait for object to sync in-cache after patching context deadline exceeded

Jul 01 '24 20:07 pkit

failed to wait for object to sync in-cache after patching context deadline exceeded

This means the controller stopped receiving data from Kubernetes API, I suspect your Kubernetes controler plane is having issues.

Oct 18 '24 12:10 stefanprodan

We are having the same problem, but also, the helm-controller pod is in a CrashLoopBackOff because of repeated failed Liveness probe .

Probably the Liveness probe should still work even if there are problems contacting the control plane

Oct 18 '24 15:10 fcuello-fudo

Probably the Liveness probe should still work even if there are problems contacting the control plane

Not if you build your controller with Kubernetes controller-runtime. Having the controller running and DDOSing the API endpoint would do you no good, kubelet will restart to controller with an exponential backoff which prevents the API server from being overloaded once it starts.

Oct 18 '24 15:10 stefanprodan

Having the controller running and DDOSing the API endpoint would do you no good,

We downgraded the control-plane (GKE rapid channel) and now everything seems to be fine again. I still haven't really found the root cause, but my point was that if the controller is behaving properly, but the k8s API is overloaded or unresponsive for some other reason than the controller, the liveness probe on the controller should still pass the checks, right?

Oct 18 '24 15:10 fcuello-fudo

the liveness probe on the controller should still pass the checks, right?

Not if the CNI is failing, kubelet can't reach the port. There is nothing special about the liveness probe, it's the standard controller-runtime ping handler https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/probes/probes.go#L45

Oct 18 '24 17:10 stefanprodan

Not if the CNI is failing, kubelet can't reach the port.

That it's not the case as there are several other applications running in the same cluster (and same node as flux controllers ) and non of them have any problems, neither communicating to the internet nor among each other.

Also, the liveness port of the flux controllers is reachable, but it just doesn't respond.

What I think is happening, is that the problematic version of the control plane has changed something related to rate limiting of API queries and that is only affecting flux because in our case it's the app that queries the k8s API the most.

I'm pretty sure we can reproduce the issue easily by switching the control plane back to the problematic version if you are willing to debug this together.

Oct 21 '24 06:10 fcuello-fudo

@fcuello-fudo if Flux runs into rate limits there must be error logs, if you can post those would be helpful. We use the Kubernetes PriorityAndFairness flow control to make our controllers comply with Kubernetes API rate limits, if the flow API is buggy this could lead to a disconnect https://github.com/fluxcd/pkg/blob/ac1007b57e37838e73b8bc95365dab9a0e856e8e/runtime/client/client.go#L76

Oct 21 '24 06:10 stefanprodan