armada icon indicating copy to clipboard operation
armada copied to clipboard

Scheduler Metrics

Open d80tb7 opened this issue 2 years ago • 0 comments

The new "pulsar backed" scheduler should expose a set of Prometheus metrics that shed light on its internal working. An initial set of metrics would be:

  • Scheduler cycle time
  • Number of jobs considered (per queue?)
  • Number of jobs scheduled (per cluster etc.)
  • Number of jobs preempted
  • Number of clusters scheduled
  • Evaluated fair share of each queue
  • Delta between fair share and usage of each queue
  • Did the cycle complete successfully (added 23/08)

Note that due to the way Prometheus works (i.e. it samples) we probably want to store some or all of these as histograms rather than gauges.

There is already some prior art for exposing Prometheus metrics in Armada- see for example here and here (the latter of those being the new scheduler exposing which instance is leader). We use the official Prometheus library for this, but we've found it quite difficult because:

  • It's hard to write unit tests
  • It is quite fiddly to use (lots of strings and array sizes that need to match up across different places in the code, with panics if they don't
  • Quite a lot of boilerplate to write
  • Everything is asynchronous.

It might therefore be worth evaluating one of two possible improvments here:

  • we're idiots and we're using this library incorrectly
  • there is another library that we can use which is more suited to our use case.

d80tb7 avatar Jul 13 '23 13:07 d80tb7