dgl
dgl copied to clipboard
Replace third_party/nccl with PyTorch's NCCL backend
🚀 Feature
Starting in PyTorch 1.11, send and recv are supported by torch.distributed, which largely means we no longer need direct access to NCCL.
To do this, we should change python/dgl/cuda/nccl.py to use torch.distributed, and remove src/runtime/cuda/nccl_api.h and src/runtime/cuda/nccl_api.cu.
Before we remove it however, we should verify the performance of sparse_all_to_all_push and spares_all_to_all_pull do not suffer.
Motivation
Removing our own version of NCCL will simplify code as we can use the existing communicator created by torch.disrtibuted, reduce build times, and decrease the likely-hood of conflicts of two versions of the libraries (and possibly multiple communicators).