buildkite-agent-metrics icon indicating copy to clipboard operation
buildkite-agent-metrics copied to clipboard

Regression in the 5.9.8 Release

Open rajatvig opened this issue 1 year ago • 6 comments
trafficstars

Issue Details

Post an upgrade from 5.9.4 to 5.9.8, we noticed that the metrics for running builds are not getting updated after the builds complete. This behaviour causes a change in scaling behavior as metric calculation we use sums running and scheduled builds for a queue to decide if there are enough agents running. The metric we see the same value for is buildkite_queues_running_jobs_count.

Setup

We are running unclustered agents and running the agent metrics binary to export metrics to Prometheus.

rajatvig avatar Sep 03 '24 10:09 rajatvig

Hi @rajatvig , thanks for raising the issue. There was one change to the Prometheus backend (#288) which may explain the issue you're seeing - all metrics now have a cluster label, where previously they may not have. This could break queries if the label doesn't match or isn't ignored appropriately. Unfortunately the change was necessary to fix a panic.

Can you share the exact PromQL query?

DrJosh9000 avatar Sep 05 '24 02:09 DrJosh9000

I did see that PR merged but wasn't able to tie it back to the issue we are seeing. We are not yet running clustered agents.

The full PromQL we use is

100 * (sum(buildkite_queues_running_jobs_count{queue="queue"} + buildkite_queues_scheduled_jobs_count{queue="queue"}) or vector(0))

That gives us a count of running and scheduled jobs that help us determine how many agents we need to run. While the buildkite_queues_scheduled_jobs_count metric was fine, the metric buildkite_queues_running_jobs_count did not go to 0 when there were no builds running.

rajatvig avatar Sep 05 '24 12:09 rajatvig

I see, interesting. The metric being stuck could be related to #296, which removed a well-intended but heavy-handed gauge reset. Is the metric stuck for all queues, or a particular queue? Is it stuck for queues that were deleted?

DrJosh9000 avatar Sep 09 '24 07:09 DrJosh9000

It was stuck for queues that were deleted, i.e. no builds were running.

rajatvig avatar Sep 11 '24 10:09 rajatvig

Sounds like #305 should fix it - I'll optimistically close this as fixed, please give v5.9.9 a try and feel free to re-open if you see the same issue.

DrJosh9000 avatar Sep 12 '24 00:09 DrJosh9000

I just gave 5.9.9 a try and still seeing similar behaviour. I setup 2 jobs on the test queue and the metric buildkite_queues_running_jobs_count{queue="test"} went to 2 and then to 1 but did not go to 0 or absent like earlier.

rajatvig avatar Sep 16 '24 22:09 rajatvig