NCCL error when trying to run GraphBolt jobs with >1 trainer per worker
🐛 Bug
I've observed an error when trying to use GraphBolt with --num-trainers >1. In this case I'm using DistGB through GraphStorm, so not sure if it's GSF or GB that the root cause. It's hard to make out from the unordered stacktrace, but listing here:
Traceback (most recent call last):
File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 190, in <module>
work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 10 and rank 8 both on CUDA device 160
super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
Client[9] in group[0] is exiting...
Client[8] in group[0] is exiting...
work = group.allreduce([tensor], opts)if dist_sum(len(val_idx)) > 0:
File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
Last error:
Duplicate GPU detected : rank 23 and rank 16 both on CUDA device 160
Client[15] in group[0] is exiting...
dist.all_reduce(size, dist.ReduceOp.SUM)
File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
super(GSgnnNodeData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
return func(*args, **kwargs)
File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2050, in all_reduce
self.prepare_data(self._g)
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
Client[7] in group[0] is exiting...
main(gs_args)
File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
main(gs_args)
File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
train_data = GSgnnNodeTrainData(config.graph_name,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
train_data = GSgnnNodeTrainData(config.graph_name,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
super(GSgnnNodeTrainData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 800, in __init__
main(gs_args) super(GSgnnNodeData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
File "/graphstorm/python/graphstorm/run/gsgnn_np/gsgnn_np.py", line 70, in main
super(GSgnnNodeData, self).__init__(graph_name, part_config,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 219, in __init__
Client[30] in group[0] is exiting...
if dist_sum(len(val_idx)) > 0:
File "/graphstorm/python/graphstorm/dataloading/utils.py", line 80, in dist_sum
self.prepare_data(self._g)
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
train_data = GSgnnNodeTrainData(config.graph_name,
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 914, in __init__
self.prepare_data(self._g)
File "/graphstorm/python/graphstorm/dataloading/dataset.py", line 959, in prepare_data
dist.all_reduce(size, dist.ReduceOp.SUM)
File "/opt/gs-venv/lib/python3.9/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper work = group.allreduce([tensor], opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1333, invalid usage (run with NCCL_DEBUG=WARN for details), NCCL version 2.18.1
ncclInvalidUsage: This usually reflects invalid usage of NCCL library.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Environment
- DGL Version (e.g., 1.0): 2.2.1
- Backend Library & Version (e.g., PyTorch 0.4.1, MXNet/Gluon 1.3): PyTorch 2.1
- OS (e.g., Linux): Linux
- How you installed DGL (
conda,pip, source): pip - Build command you used (if compiling from source):
- Python version: 3.9
- CUDA/cuDNN version (if applicable): 12.1
- GPU models and configuration (e.g. V100):
- Any other relevant information:
Additional context
@thvasilo Does NCCL + num_trainers>1 works well for DistDGL(no graphbolt)? I think it's not related to DistGB. And it seems to be incomplete support of NCCL.
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
This is a confirmed issue. The workaround is always setting num_trainers to 1.