distributed icon indicating copy to clipboard operation
distributed copied to clipboard

Total CPU % on /workers tab makes little sense

Open crusaderky opened this issue 1 year ago • 2 comments

From https://dask.discourse.group/t/dask-worker-using-600-of-the-cpu/2489/2

The CPU % on each individual worker scales from 0 to nthreads*100; e.g. on a worker with 8 threads it can go from 0% to 800%. This is coherent with several other CPU monitors in the wild so it makes sense.

The CPU% on the Total row, however, is calculated as https://github.com/dask/distributed/blob/4425516f86d7de9aed06053fc3e21a17fe20efc4/distributed/dashboard/components/scheduler.py#L4303-L4308

So for example on a cluster with 2 workers, 8 threads per worker, if one worker is flat out busy while the other is idle, the Total will be 400%, which makes very little sense.

name nthreads cpu
Total (2) 16 400%
tcp://... 8 800%
tcp://... 8 0%

I think we should change the CPU% on each worker to go from 0 to 100% and that on the total line to do the same (total CPU usage across the cluster / total number of threads) In the above example, that would become

name nthreads cpu
Total (2) 16 50%
tcp://... 8 100%
tcp://... 8 0%

crusaderky avatar Feb 03 '24 18:02 crusaderky

I'm a bit torn here:

On the one hand, summing the CPU usage up feels useful for a single worker. For example, being stuck at 100% might indicate that we're not able to effectively use our multi-core CPUs. Being stuck at 12.5% (on an 8-core machine) feels less useful, in particular, since we don't ever tell you the number of cores on your machine.

On the other hand, summing the CPU usage up makes the total meaningless very quickly, in particular on an adaptive cluster. (Cool, CPU usage is >9000%....what does that even mean?)

There are a few alternatives that come to mind:

  • Sum everything up, but provide an upper bound 400% (800%) (presentation is TBD).
  • Collect the CPU statistics per CPU (psutil.cpu_percent(percpu=True)) and calculate some truly meaningful statistics, e.g., min, max, mean, median, 20/80 pct, etc.

hendrikmakait avatar Feb 05 '24 08:02 hendrikmakait

  • xref https://github.com/dask/distributed/pull/3897

crusaderky avatar Feb 05 '24 10:02 crusaderky