hail icon indicating copy to clipboard operation
hail copied to clipboard

[batch] Use a simulated job queue to estimate the ready cores in the control loop

Open jigold opened this issue 1 year ago • 1 comments

I added a new Grafana panel without alerts that hopefully will let us catch problems if jobs aren't getting scheduled in a timely manner. I think to have an alert, we'd want to measure what the average wait time of a job in the queue is which would require more infrastructure (keeping track of last state change). We can consider adding that now -- not sure how much work it would be.

jigold avatar Sep 21 '22 16:09 jigold

I changed the control loop to use the old method of estimating ready cores for now. I'll change it to the new way in a second PR. This way we can make sure we're happy with the metrics in Grafana before I change the behavior.

jigold avatar Sep 22 '22 19:09 jigold

This is ready for a look. The tests are passing except one service backend test timed out.

jigold avatar Sep 28 '22 12:09 jigold