agent-stack-k8s
agent-stack-k8s copied to clipboard
Controller stops accepting jobs from the cluster queue
We have the agent-stack-k8s
up and running and works fine for a while. However, it suddenly stops accepting new jobs and the last thing it outputs is (we turned on debug):
2024-04-08T11:38:23.100Z DEBUG limiter scheduler/limiter.go:77 max-in-flight reached {"in-flight": 25}
We currently only have a single pipeline, single cluster and single queue. When this happens there are no jobs or pods named buildkite-${UUID}
in the k8s cluster. Executing kubectl -n buildkite rollout restart deployment agent-stack-k8s
makes the controller happy again and it starts jobs from the queue.
I suspect that there is something that should decrement the in-flight
number, but fails to do so. We are now running a test where this number is set to 0 to see if that works around the problem.
Hi @aressem, did you discover anything with your tests where the number is set to 0?
@DrJosh9000 , the pipeline works as expected with in-flight
set to 0. I don't know what that number might be now, but I suspect it is steadily increasing :)
Same issue when testing with max-in-flight: 1
on v0.11.0
, at some point controller stops taking new jobs even though there are no jobs/pods running in the namespace besides the controller iteself.
2024-05-21T21:31:57.923Z DEBUG limiter scheduler/limiter.go:79 max-in-flight reached {"in-flight": 1}