DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Tests should fail indicating actual number of GPUs is below desired world_size

Open clumsy opened this issue 2 years ago • 0 comments

I faced this cryptic error while running tests on a device with a single GPU.

DeepSpeed: master PyTroch: 1.12.1 NCCL: 2.10.3

Current Behavior

Steps to reproduce:

  1. pytest tests/unit/checkpoint/test_moe_checkpoint.py -k 'test_checkpoint_moe_and_zero'
  2. Observe:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.9/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/tests/unit/common.py", line 165, in _dist_init
    dist.barrier()
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 127, in log_wrapper
    return func(*args, **kwargs)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/comm.py", line 469, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/home/azzhipa/workspace/m5/DeepSpeed/deepspeed/comm/torch.py", line 160, in barrier
    return torch.distributed.barrier(group=group,
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
  File "/opt/conda/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 2786, in barrier
    work = group.barrier(opts=opts)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, invalid usage, NCCL version 2.10.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

It's hard to figure out why the tests failed, I was only able to fix it when I saw https://github.com/microsoft/DeepSpeed/issues/2482

In this particular case there's no indication what's wrong even though at runtime we know the desired world_size and actual number of devices.

Expected Behavior

Test should fail saying something like num_gpus < world_size.

clumsy avatar Jan 20 '23 22:01 clumsy