ucx
ucx copied to clipboard
UCS/TOPO: NVML topology module
What
When one of the devices passed to ucs_topo_get_distance
is a GPU device, let NVML provide the estimation of latency and bandwidth between the GPU device and 1. another GPU device (including itself) 2. another CPU device (when sys_device = DEVICE_UNKNOWN).
Why ?
Need to get more accurate estimations of bandwidth and latency between CPU<->GPU for upcoming platforms.
Limitations
- Sys_devices provided to
get_distance
are local. So currently no way to use this with an actual remote device and get correct estimation of bandwidth between 2 GPUs connected by NVLINK/NVSwitch - CPU<->GPU path isn't actually checked and peak PCIe bandwidth is assumed