image-automation-controller icon indicating copy to clipboard operation
image-automation-controller copied to clipboard

Flux 2.2, image-automation-controller stops reconciling; needed restart

Open 0xStarcat opened this issue 1 year ago • 13 comments

Making a new issue after seeing this comment in the last issue on this topic: https://github.com/fluxcd/image-automation-controller/issues/282#issuecomment-1151022724

Specs

On Flux v2.2 with the image-update-automation pod running the image image-automation-controller:v0.37.0 for image.toolkit.fluxcd.io/v1beta1 automation resources

Description

The issue we saw was the image-automation-controller stopped reconciling for all ImageUpdateAutomation resources in the cluster for ~ 5 hours, then when it resumed stopped reconciling only 1 ImageUpdateAutomation resource out of many in the cluster.

To resolve it, had to delete the image-automation-controller pod and upon restart everything worked fine.

The following telemetry screenshots show CPU usage dropping off entirely, log volume dropping off entirely, and no errors produced in logs. There were no pod restarts / crashloops or any k8s events for the pods.

Screenshot 2024-05-29 at 2 53 47 PM
Screenshot 2024-05-29 at 2 54 30 PM

I've attached a debug/pprof/goroutine?debug=2" profiler to this issue as well.

heap.txt

Please let me know if there's any further details I can provide. Thank you!

0xStarcat avatar May 29 '24 19:05 0xStarcat