cuml [FEA] cuML comms to use single UCX endpoint on each worker process

There is currently a bug in UCX that is caused by creating and destroying multiple endpoints on a single process. This bug causes each endpoint to sometimes reopen CUDA IPC handles from its local cache, even when another endpoint on the same process may have already opened it. Since CUDA IPC handles are not allowed to be opened more than once by the same process, an exception gets raised. UCX provides an environment variable UCX_CUDA_IPC_CACHE that enables the caching to be disabled, however, as @pentschev mentions, this comes at the cost of performance, and runs the possibility of degrading the transport speed to that of TCP.

An additional challenge lies in setting this environment variable, as the Dask cluster may already have initialized UCP by the time cuML algorithms are invoked. This means it will be too late for cuML to set this property causing the need for the user to set it explicitly.

The long-term solution for this problem is to fix the CUDA IPC transport in UCT and @pentschev and I will engage @Akshay-Venkatesh when his schedule settles a bit (May timeframe, maybe?)

A shorter-term solution is to have cuML guarantee only a single endpoint is ever created per worker, which will also mean using Dask's endpoints when the cluster has been started with protocol=UCX. This should be possible since cuML is using its own callback functions and tags. I have verified this w/ @madsbk as well.

In the meantime, a quick PR can be filed immediately to

Provide a warning to users that did not start their Dask cluster with the UCX_CUDA_IPC_CACHE environment variable. Note that users may not experience this exception if they only run a small cuml nearest neighbors job once on their cluster. The reason this is such a high priority, however, is because most users will be expected to be doing data exploration and potentially submitting several jobs to the cluster to get the desired results.
Explicitly initialize UCP (with UCX_CUDA_IPC_CACHE=n) when the Dask protocol != ucx. If we explicitly initialize UCP, the users would still need to manually set UCX_CUDA_IPC_CACHE=n when protocol=ucx, but this is just to get us to the shorter-term solution of sharing endpoints on the workers.

Apr 07 '20 14:04 cjnolet

https://github.com/rapidsai/ucx-py/issues/454

Apr 07 '20 14:04 cjnolet

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

Mar 14 '21 19:03 github-actions[bot]

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

Mar 14 '21 19:03 github-actions[bot]

It looks like this was resolved. @cjnolet is this still relevant?

Sep 12 '22 15:09 beckernick