gloo Take CUDA peer access into account for on-device reduction

Take CUDA peer access into account for on-device reduction

Open pietern opened this issue 8 years ago • 6 comments

The NVLink cube mesh architecture has partial peer access between devices. Two groups of 4 GPUs have full peer access and every GPU in one group has peer access to a corresponding GPU in the other group. When we reduce between all 8 GPUs, tree reduction using peer access must be done separately in those groups, followed by a reduction across any one of the pairs connecting the groups. This change refactors CudaDeviceReduce to work with this topology.

Additionally, it shuffles device pointers to randomize which GPUs run the reduction and which communication links are used. The goal of randomization here is to prevent excessive load on any single GPU or link.

Aug 04 '17 17:08 pietern

Fixing a few bugs...

Aug 04 '17 18:08 pietern

Was this problem solved?

Sep 22 '17 08:09 Hiroki11x

@Hiroki11x Making a proper fix for this is still on the back burner. In the mean time you can compile with NCCL support and everything will work out of the box. Are you looking to specifically use the non-NCCL approach/code?

Sep 25 '17 16:09 pietern

@zpao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sep 27 '17 21:09 facebook-github-bot

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email [email protected] with your details so we can update your status.

Jul 25 '18 21:07 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

May 08 '19 13:05 facebook-github-bot

gloo gloo copied to clipboard

Take CUDA peer access into account for on-device reduction

gloo
gloo copied to clipboard