distributed
distributed copied to clipboard
Total CPU % on /workers tab makes little sense
From https://dask.discourse.group/t/dask-worker-using-600-of-the-cpu/2489/2
The CPU % on each individual worker scales from 0 to nthreads*100; e.g. on a worker with 8 threads it can go from 0% to 800%. This is coherent with several other CPU monitors in the wild so it makes sense.
The CPU% on the Total row, however, is calculated as https://github.com/dask/distributed/blob/4425516f86d7de9aed06053fc3e21a17fe20efc4/distributed/dashboard/components/scheduler.py#L4303-L4308
So for example on a cluster with 2 workers, 8 threads per worker, if one worker is flat out busy while the other is idle, the Total will be 400%, which makes very little sense.
| name | nthreads | cpu |
|---|---|---|
| Total (2) | 16 | 400% |
| tcp://... | 8 | 800% |
| tcp://... | 8 | 0% |
I think we should change the CPU% on each worker to go from 0 to 100% and that on the total line to do the same (total CPU usage across the cluster / total number of threads) In the above example, that would become
| name | nthreads | cpu |
|---|---|---|
| Total (2) | 16 | 50% |
| tcp://... | 8 | 100% |
| tcp://... | 8 | 0% |
I'm a bit torn here:
On the one hand, summing the CPU usage up feels useful for a single worker. For example, being stuck at 100% might indicate that we're not able to effectively use our multi-core CPUs. Being stuck at 12.5% (on an 8-core machine) feels less useful, in particular, since we don't ever tell you the number of cores on your machine.
On the other hand, summing the CPU usage up makes the total meaningless very quickly, in particular on an adaptive cluster. (Cool, CPU usage is >9000%....what does that even mean?)
There are a few alternatives that come to mind:
- Sum everything up, but provide an upper bound
400% (800%)(presentation is TBD). - Collect the CPU statistics per CPU (
psutil.cpu_percent(percpu=True)) and calculate some truly meaningful statistics, e.g.,min,max,mean,median,20/80 pct, etc.
- xref https://github.com/dask/distributed/pull/3897