agent-stack-k8s icon indicating copy to clipboard operation
agent-stack-k8s copied to clipboard

Controller stops accepting jobs from the cluster queue

Open aressem opened this issue 10 months ago • 3 comments

We have the agent-stack-k8s up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):

2024-04-08T11:38:23.100Z	DEBUG	limiter	scheduler/limiter.go:77	max-in-flight reached	{"in-flight": 25}

We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID} in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s makes the controller happy again and it starts jobs from the queue.

I suspect that there is something that should decrement the in-flight number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.

aressem avatar Apr 08 '24 12:04 aressem

Hi @aressem, did you discover anything with your tests where the number is set to 0?

DrJosh9000 avatar Apr 23 '24 04:04 DrJosh9000

@DrJosh9000 , the pipeline works as expected with in-flight set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)

aressem avatar Apr 23 '24 07:04 aressem

Same issue when testing with max-in-flight: 1 on v0.11.0, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.

2024-05-21T21:31:57.923Z	DEBUG	limiter	scheduler/limiter.go:79	max-in-flight reached	{"in-flight": 1}

artem-zinnatullin avatar May 21 '24 21:05 artem-zinnatullin