ucx icon indicating copy to clipboard operation
ucx copied to clipboard

UCS/TOPO: NVML topology module

Open Akshay-Venkatesh opened this issue 1 year ago • 0 comments

What

When one of the devices passed to ucs_topo_get_distance is a GPU device, let NVML provide the estimation of latency and bandwidth between the GPU device and 1. another GPU device (including itself) 2. another CPU device (when sys_device = DEVICE_UNKNOWN).

Why ?

Need to get more accurate estimations of bandwidth and latency between CPU<->GPU for upcoming platforms.

Limitations

  • Sys_devices provided to get_distance are local. So currently no way to use this with an actual remote device and get correct estimation of bandwidth between 2 GPUs connected by NVLINK/NVSwitch
  • CPU<->GPU path isn't actually checked and peak PCIe bandwidth is assumed

Akshay-Venkatesh avatar Aug 25 '23 20:08 Akshay-Venkatesh