Diagnostics page for Adaptive decisions

Open TomAugspurger opened this issue 5 years ago • 0 comments

It can be somewhat hard to determine when / why the scheduler decides to scale the cluster under adaptive mode. Ideally a dashboard page could shed some light here.

We currently have /json/counts.json which provides desired_workers. I think that's it.

I think there are two main pieces of information to convey:

Stock: The current state of things including current CPU load, current CPU capacity, and the current desired CPU capacity. Likewise for memory
Flow: The history of decisions on when to scale up / down the cluster (ideally with information on why those decisions were made (the state at that time)

Here's a rough sketch for number 1.

Adaptive sketch

cc @rsignell-usgs, @jsignell for adaptive things, and @jacobtomlinson for dashboard design things.

Mar 27 '20 13:03 TomAugspurger