Peter Andreas Entschev comments

Results 210 comments of


                                            Peter Andreas Entschev

Raise error when setting communication protocols if only 1 GPU is used or if hardware is uncapable

> When reading through the documentation of spilling, https://docs.rapids.ai/api/dask-cuda/stable/spilling.html, it is unclear how the CPU-GPU communication is done if UCX is actively chosen as the communication protocol. Will workers write...

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

In the case above you're only using `cudaMemcpy`, which will work either way, even if it has to go through host via PCIe. The original problem was with CUDA IPC...

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

Yes, that looks correct. Were you able to confirm the same on Dask-CUDA with multiple workers?

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

It looks fine, considering the known instability of results between iterations. Just to have one last metric, could you also run both by specifying `UCX_TLS=^cuda_ipc`?

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

Yup, agreed this looks right. It seems we indeed don't need `CUDA_VISIBLE_DEVICES` to list all GPUs, only the first one should suffice. Therefore, the idea I have in mind is...

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

> I think this is easy because: > > ```python > import numba.cuda > > uuid2devnum = dict((dev.uuid, dev.id) for dev in numba.cuda.list_devices()) > ``` Sorry, I missed this, but...

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

> Once we decide on a replacement, it would be good to solicit feedback from other users to make sure it still satisfies their use cases Do you have a...

Get rid of setting `CUDA_VISIBLE_DEVICES` as an environment variable to `Nanny`

> Hi, if this feature is planned for future release, would be great if there's a method abstracted by dask for downstream projects to obtain the GPU ordinal. Currently, XGBoost...

[FEA] Support for NUMA-aware spilling

I'm not sure if you're trying to do something much more specific, but one of the features Dask-CUDA provides is setting CPU affinity (NUMA-ness) for each GPU, see https://github.com/rapidsai/dask-cuda/blob/09196cb5c92effca6da660231e58a5bf4ac72c76/dask_cuda/cuda_worker.py#L215 https://github.com/rapidsai/dask-cuda/blob/09196cb5c92effca6da660231e58a5bf4ac72c76/dask_cuda/local_cuda_cluster.py#L311...

[FEA] Support for NUMA-aware spilling

Thanks for the details @lmeyerov . As you noticed, the `--local-directory` is something you could set to have spilling on a high bandwidth FS. However, the remaining of your question...