ucx
ucx copied to clipboard
UCT/CUDA_COPY: add multi-device support in cuda_copy
What/Why?
Allow a single UCP context to handle multiple CUDA devices for cuda_copy transport. This enables use cases under Legion/Realm, OpenACC, and MPI workloads that prefer 1:N process-to-GPU mapping than the default current 1:1 mapping.
How ?
CUDA stream and event resources which were previously tied to iface now are tied to each newly detected cuda device context. When resources are needed, context ID is looked up using a hashtable and appropriate resources are picked.
TODO
- ~~Need a way to detect if cuda context is destroyed before destroying stream/event resources associated with that context~~ (not going to cleanup resources and leave it to the OS to handle it)
- ~~Need to check if stream bitmap is needed for flush operations and flush each individually using streamsync~~ (removed)
@brminich I see one of the commits had an extra colon and 2 commit style tests are failing because of that. Would it be ok to rebase? I can wait to do this until all the reviewers have had a chance to look at my comments and code changes.
cc @rakhmets @SeyedMir
@Akshay-Venkatesh Rebase is fine with me.
@Akshay-Venkatesh, no problem from my side
@brminich @rakhmets @SeyedMir
FYI, in https://github.com/openucx/ucx/pull/9645/commits/dd8b66d905c3363cec94554c8f16d70a2966adb9 I had to remove all code that does EventDestroy or StreamDestroy as CUDA doesn't have a way to query if a give CUcontext has been destroyed or not and calling Stream/EventDestroy on streams/events whose context has been destroyed is potentially unsafe. For this reason we will have to leave it to the point when the process is cleaned up. This should be safe from UCX's viewpoint as all UCT resources are tied to some UCP context and there isn't a concern of reusing streams/events that haven't been cleaned up (as they are not global).
Also, it looks like cuCtxGetId
is supported for CUDA >=12.0. Without context ID, we don't have a way to query which context we're trying to use and pick associated stream/event resources for transport operations. We cannot use CUcontext handle itself instead of context ID because we cannot assume that the handle returned by say cuCtxGetCurrent
will always return the same handle as opposed to a handle that has the same properties. So it seems that multi-device support will need CUDA >= 12.0. We should discuss more about this.