distributed
distributed copied to clipboard
`dask worker --nworkers -1` does not use all CPUs available
Describe the issue:
The documentation says that if the argument passed to --nworkers
is negative, then (CPU_COUNT + 1 + nworkers)
is used for the number of processes. I have 2 machines in the cluster, both with the same specs (nproc = 256
) however I do not get 256 workers.
Full CPU specs:
> lscpu | egrep 'Model name|Socket|Core|Thread|NUMA|CPU\(s\)'
CPU(s): 256
On-line CPU(s) list: 0-255
Model name: AMD EPYC 7713 64-Core Processor
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
The number i actually get seems to fluctuate each time I run the dask worker
command. One server normally starts with around 210 workers, and the other with around 70 workers, but this changes. The UI therefore reports about 280 workers in total with 1 thread each.
Firstly, why the variability? And secondly, how can I maximise this count? The workloads I need to run are simple single process medium length tasks.
Minimal Complete Verifiable Example:
# Scheduler
dask scheduler --host X.X.X.7
# Workers
dask worker tcp://X.X.X.7:8786 --nworkers -1
Anything else we need to know?: Things I've ruled out:
- setting
--nworkers
explicitly to 256 behaves the same as -1 - also specifying
--nthreads
makes no change - halfing the nworkers to 128 works - I get 128 workers with 2 threads each. Is dask hitting an upper limit here? If so why is it vaiable between 70-210?
- nothing else is consuming significant resources on the servers
Environment:
- Dask version: 2024.6.2
- Python version: 3.10.12
- Operating System: Ubuntu 22.04.4 LTS
- Install method (conda, pip, source): pip