ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCT/CUDA_COPY: add multi-device support in cuda_copy

Open Akshay-Venkatesh opened this issue 1 year ago • 4 comments

What/Why?

Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.

How ?

CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.

TODO

  1. ~~Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context~~ (not going to cleanup resources and leave it to the OS to handle it)
  2. ~~Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync~~ (removed)

Akshay-Venkatesh avatar Jan 30 '24 18:01 Akshay-Venkatesh

@brminich I see one of the commits had an extra colon and 2 commit style tests are failing because of that. Would it be ok to rebase? I can wait to do this until all the reviewers have had a chance to look at my comments and code changes.

cc @rakhmets @SeyedMir

Akshay-Venkatesh avatar Feb 23 '24 17:02 Akshay-Venkatesh

@Akshay-Venkatesh Rebase is fine with me.

SeyedMir avatar Feb 23 '24 17:02 SeyedMir

@Akshay-Venkatesh, no problem from my side

brminich avatar Feb 23 '24 17:02 brminich

@brminich @rakhmets @SeyedMir

FYI, in https://github.com/openucx/ucx/pull/9645/commits/dd8b66d905c3363cec94554c8f16d70a2966adb9 I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global).

Also, it looks like cuCtxGetId is supported for CUDA >=12.0. Without context ID, we don't have a way to query which context we're trying to use and pick associated stream/event resources for transport operations. We cannot use CUcontext handle itself instead of context ID because we cannot assume that the handle returned by say cuCtxGetCurrent will always return the same handle as opposed to a handle that has the same properties. So it seems that multi-device support will need CUDA >= 12.0. We should discuss more about this.

Akshay-Venkatesh avatar Feb 28 '24 23:02 Akshay-Venkatesh