ompi icon indicating copy to clipboard operation
ompi copied to clipboard

opal/mca/ofi: select NIC closest to accelerator if requested

Open wenduwan opened this issue 1 year ago • 3 comments

This patch introduces a new capability to select NIC closest to the user requested accelerator (PCI) device. The implementation should suit all accelerator types, i.e. cuda & rocm. This change addresses https://github.com/open-mpi/ompi/issues/11696

In this patch, we introduce a overriding logic when an accelerator has been initialized - instead of selecting a NIC on the package, we select a NIC closest to the accelerator(might be on a different package).

The impl depends on the following APIs:

  • accelerator.get_device_pci_attr: Retrieve the PCI info of the accelerator.
  • hwloc_get_pcidev_by_busid: Get the hwloc object of the accelerator and provider(NIC)
  • hwloc_get_common_ancestor_obj: Get the closest common ancestor hwloc object between the accelerator and provider

The NIC selection logic can be summarized as following:

  • Among available NICs, find those closest to the accelerator device. Here we choose to not use the pmix_device_distance_t or hwloc_distances_s for practical reasons - they are not computable for every platform, e.g. on AWS EC2 we cannot reliably get such values between GPU and NIC. Instead the device proximity is measured as the depth of the common ancestor, see https://www.open-mpi.org/projects/hwloc/doc/v2.9.1/a00359.php
  • When there is a tie, break the tie using a modulo (local rank on the same accelerator) % (number of nearest providers). Note that we do not have a good way to calculate local rank on the same accelerator, so instead we reuse local rank on the same package as a proxy.

wenduwan avatar May 24 '23 00:05 wenduwan