dask-cuda
dask-cuda copied to clipboard
[FEA] Consider using different default values for cluster configurations
In our gpu-bdb benchmarking, we found we needed to configure quite a few bash environment variables when setting up a dask-cuda cluster for optimal performance.
For example, if you're not using UCX, it's likely that GPU worker to GPU worker communication over TCP is at least higher latency, if not slower. To avoid TCP connection failures we needed to reconfigure Dask's "COMM" settings.
export DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT="100s"
export DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP="600s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MIN="1s"
export DASK_DISTRIBUTED__COMM__RETRY__DELAY__MAX="60s"
I suspect a great many dask-cuda users will be using TCP and not UCX, so would benefit from dask-cuda automatically re-configuring Dask's TCP defaults.
We also need to set --memory-limit
, --device-memory-limit
, and --rmm-pool-size
values:
MAX_SYSTEM_MEMORY=$(free -m | awk '/^Mem:/{print $2}')M
DEVICE_MEMORY_LIMIT="25GB"
POOL_SIZE="30GB"
dask-cuda-worker --device-memory-limit $DEVICE_MEMORY_LIMIT --rmm-pool-size=$POOL_SIZE --memory-limit=$MAX_SYSTEM_MEMORY ...
It would be nice if dask-cuda included logic for detecting GPU memory per card and setting default values for device memory limit and RMM pool size. As a starting point, for a 32GB card, 25GB memory limit with 30GB pool size work well for the 30 tpcx-bb queries. I believe that's a decent representation of typical workflows where those values could be used proportionally for other card memory sizes as well.
I think having some of these defaults makes sense . In dask core --memory-limit
is set to auto
by default .
In [1]: import pynvml
In [2]: pynvml.nvmlInit()
In [3]: handle = pynvml.nvmlDeviceGetHandleByIndex(0)
In [4]: info = pynvml.nvmlDeviceGetMemoryInfo(handle)
In [5]: info.total
Out[6]: 34089730048
maybe rmm-pool-size = Total*90%
. @pentschev @jakirkham do you have thoughts here ?
@mt-jones may remember why we had to set --memory-limit
specifically. There might have been a bug where it wasn't being set properly, or the heuristic Dask uses is inappropriate for, say a DGX1.
While I understand how changing such defaults make sense for TPCx-BB, I'm not so sure this makes sense for everyone, I don't think it will be difficult to find people trying to run dask-cuda together with other applications that also require memory (either device or host), and that could be a bit annoying for users. Think for example of someone running this on a workstation for which one of the GPUs is also used as display. That's not to say I'm totally against the idea, but I'm just a bit concerned about the broader usability of dask-cuda.
The --memory-limit
issue was really a bug that got fixed in https://github.com/rapidsai/dask-cuda/pull/269 .
As for the communication variables, I think they were not for the workers (or not only, perhaps), but for the scheduler, am I mistaken? Changing the defaults for the scheduler would have to go to distributed upstream. As for workers, if we change those values we should probably check what would be really appropriate numbers, I have a feeling those were just defined arbitrarily high to make things work, was that not the case? And when we scale, would we have to scale the numbers too?
I was just thinking after writing the comment above, maybe we should have some sort of "default recipes" for different use cases? E.g., the TPCx-BB case could use a "performance recipe", while today's defaults could be something like a "conservative recipe" and so on. I think this would alleviate complexity for multiple uses while still providing defaults that should work mostly anywhere and have an easy switch.
I forgot to mention another related (potentially duplicate) issue: https://github.com/rapidsai/dask-cuda/issues/334 .
This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.