Scheduler Metrics

Open d80tb7 opened this issue 2 years ago • 0 comments

The new "pulsar backed" scheduler should expose a set of Prometheus metrics that shed light on its internal working. An initial set of metrics would be:

Scheduler cycle time
Number of jobs considered (per queue?)
Number of jobs scheduled (per cluster etc.)
Number of jobs preempted
Number of clusters scheduled
Evaluated fair share of each queue
Delta between fair share and usage of each queue
Did the cycle complete successfully (added 23/08)

Note that due to the way Prometheus works (i.e. it samples) we probably want to store some or all of these as histograms rather than gauges.

There is already some prior art for exposing Prometheus metrics in Armada- see for example here and here (the latter of those being the new scheduler exposing which instance is leader). We use the official Prometheus library for this, but we've found it quite difficult because:

It's hard to write unit tests
It is quite fiddly to use (lots of strings and array sizes that need to match up across different places in the code, with panics if they don't
Quite a lot of boilerplate to write
Everything is asynchronous.

It might therefore be worth evaluating one of two possible improvments here:

we're idiots and we're using this library incorrectly
there is another library that we can use which is more suited to our use case.

Jul 13 '23 13:07 d80tb7