gloo icon indicating copy to clipboard operation
gloo copied to clipboard

Take CUDA peer access into account for on-device reduction

Open pietern opened this issue 8 years ago • 6 comments

The NVLink cube mesh architecture has partial peer access between devices. Two groups of 4 GPUs have full peer access and every GPU in one group has peer access to a corresponding GPU in the other group. When we reduce between all 8 GPUs, tree reduction using peer access must be done separately in those groups, followed by a reduction across any one of the pairs connecting the groups. This change refactors CudaDeviceReduce to work with this topology.

Additionally, it shuffles device pointers to randomize which GPUs run the reduction and which communication links are used. The goal of randomization here is to prevent excessive load on any single GPU or link.

pietern avatar Aug 04 '17 17:08 pietern

Fixing a few bugs...

pietern avatar Aug 04 '17 18:08 pietern

Was this problem solved?

Hiroki11x avatar Sep 22 '17 08:09 Hiroki11x

@Hiroki11x Making a proper fix for this is still on the back burner. In the mean time you can compile with NCCL support and everything will work out of the box. Are you looking to specifically use the non-NCCL approach/code?

pietern avatar Sep 25 '17 16:09 pietern

@zpao has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot avatar Sep 27 '17 21:09 facebook-github-bot

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email [email protected] with your details so we can update your status.

facebook-github-bot avatar Jul 25 '18 21:07 facebook-github-bot

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebook-github-bot avatar May 08 '19 13:05 facebook-github-bot