dgl NCCL error when trying to run GraphBolt jobs with >1 trainer per worker

🐛 Bug

I've observed an error when trying to use GraphBolt with --num-trainers >1. In this case I'm using DistGB through GraphStorm, so not sure if it's GSF or GB that the root cause. It's hard to make out from the unordered stacktrace, but listing here:

    Traceback (most recent call last):
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 190, in <module>
        work = group.allreduce([tensor], opts)
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
    Last error:
    Duplicate GPU detected : rank 10 and rank 8 both on CUDA device 160
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
    Client[9] in group[0] is exiting...
    Client[8] in group[0] is exiting...
            work = group.allreduce([tensor], opts)if dist_sum(len(val_idx)) > 0:
     
      File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
    Last error:
    Duplicate GPU detected : rank 23 and rank 16 both on CUDA device 160
    Client[15] in group[0] is exiting...
        dist.all_reduce(size, dist.ReduceOp.SUM)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
        super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
        return func(*args, **kwargs)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
    Client[7] in group[0] is exiting...
        main(gs_args)
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        main(gs_args)
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
        super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
        main(gs_args)    super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
     
      File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
        super(GSgnnNodeData, self).__init__(graph_name, part_config,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
    Client[30] in group[0] is exiting...
        if dist_sum(len(val_idx)) > 0:
      File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
        train_data = GSgnnNodeTrainData(config.graph_name,
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
        self.prepare_data(self._g)
      File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
        dist.all_reduce(size, dist.ReduceOp.SUM)
      File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper                                                                                                                                                                                                                                                              work = group.allreduce([tensor], opts)
    torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
    ncclInvalidUsage: This usually reflects invalid usage of NCCL library.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

Environment

DGL Version (e.g., 1.0): 2.2.1
Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1
OS (e.g., Linux): Linux
How you installed DGL (conda, pip, source): pip
Build command you used (if compiling from source):
Python version: 3.9
CUDA/cuDNN version (if applicable): 12.1
GPU models and configuration (e.g. V100):
Any other relevant information:

Additional context

May 23 '24 21:05 thvasilo

@thvasilo Does NCCL + num_trainers>1 works well for DistDGL(no graphbolt)? I think it's not related to DistGB. And it seems to be incomplete support of NCCL.

May 27 '24 01:05 Rhett-Ying

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

Jun 26 '24 01:06 github-actions[bot]

This is a confirmed issue. The workaround is always setting num_trainers to 1.

Jun 27 '24 02:06 jermainewang