DeepSpeed
DeepSpeed copied to clipboard
Tests should fail indicating actual number of GPUs is below desired world_size
I faced this cryptic error while running tests on a device with a single GPU.
DeepSpeed: master
PyTroch: 1.12.1
NCCL: 2.10.3
Current Behavior
Steps to reproduce:
pytest tests/unit/checkpoint/test_moe_checkpoint.py -k 'test_checkpoint_moe_and_zero'- Observe:
Traceback (most recent call last):
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
dist.barrier()
File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
dist.barrier()
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
return func(*args, **kwargs)
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
return torch.distributed.barrier(group=group,
File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
return torch.distributed.barrier(group=group,
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
work = group.barrier(opts=opts)
File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
It's hard to figure out why the tests failed, I was only able to fix it when I saw https://github.com/microsoft/DeepSpeed/issues/2482
In this particular case there's no indication what's wrong even though at runtime we know the desired world_size and actual number of devices.
Expected Behavior
Test should fail saying something like num_gpus < world_size.