Akshay-Venkatesh
Akshay-Venkatesh
Reduce_local implementation is missing which causes failures in IMB. The implementation piggybacks on existing cuda reduce implementation to stage/unstage send/receive buffers. bot:notacherrypick
## What Follow up to https://github.com/openucx/ucx/pull/9982. This PR caches the operation that imports remotely exported handle for a custom CUDA memory pool as the mapping operation via `cuMemPoolImportFromShareableHandle` is expensive.
## What When one of the devices passed to `ucs_topo_get_distance` is a GPU device, let NVML provide the estimation of latency and bandwidth between the GPU device and 1. another...
## Why ? Allow cumemcreate memory allocations to be registered with IB.
Port of https://github.com/open-mpi/ompi/pull/12835