gloo
gloo copied to clipboard
Take CUDA peer access into account for on-device reduction
The NVLink cube mesh architecture has partial peer access between devices. Two groups of 4 GPUs have full peer access and every GPU in one group has peer access to a corresponding GPU in the other group. When we reduce between all 8 GPUs, tree reduction using peer access must be done separately in those groups, followed by a reduction across any one of the pairs connecting the groups. This change refactors CudaDeviceReduce to work with this topology.
Additionally, it shuffles device pointers to randomize which GPUs run the reduction and which communication links are used. The goal of randomization here is to prevent excessive load on any single GPU or link.
Fixing a few bugs...
Was this problem solved?
@Hiroki11x Making a proper fix for this is still on the back burner. In the mean time you can compile with NCCL support and everything will work out of the box. Are you looking to specifically use the non-NCCL approach/code?
@zpao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.
Before we can review or merge your code, we need you to email [email protected] with your details so we can update your status.
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!