dgl icon indicating copy to clipboard operation
dgl copied to clipboard

Replace third_party/nccl with PyTorch's NCCL backend

Open nv-dlasalle opened this issue 3 years ago • 0 comments

🚀 Feature

Starting in PyTorch 1.11, send and recv are supported by torch.distributed, which largely means we no longer need direct access to NCCL.

To do this, we should change python/dgl/cuda/nccl.py to use torch.distributed, and remove src/runtime/cuda/nccl_api.h and src/runtime/cuda/nccl_api.cu.

Before we remove it however, we should verify the performance of sparse_all_to_all_push and spares_all_to_all_pull do not suffer.

Motivation

Removing our own version of NCCL will simplify code as we can use the existing communicator created by torch.disrtibuted, reduce build times, and decrease the likely-hood of conflicts of two versions of the libraries (and possibly multiple communicators).

nv-dlasalle avatar Sep 19 '22 17:09 nv-dlasalle