DeepSpeed Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2

Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}") In TensorParallel=2, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So I believe this bcast should use local rank 0 inside current self.ds_process_group.

Mar 13 '23 07:03 YizhouZ

@tjruwase Hi, We found a bug in DeepSpeed that when enabling tensor parallel = 2 on Megatron-DeepSpeed 20B 4nodes, would meet below error: RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}") Under this case, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So we believe this bcast should use local rank 0 inside current self.ds_process_group rather than global rank 0. After this fix, the result will be broadcasting parameters inside each ds_process_group from local rank 0 to the rest of ranks.

Do you have any comments on this PR?

Mar 20 '23 08:03 YizhouZ

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

Mar 20 '23 11:03 tjruwase

@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.

@tjruwase Thanks!

Mar 21 '23 02:03 YizhouZ

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

Apr 13 '23 12:04 tjruwase

@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?

Hi @tjruwase only stage 3 would trigger this post_init_method, others would not go into this place as far as my test result shows.

Apr 14 '23 07:04 YizhouZ

@YizhouZ, thanks for confirmation. That makes sense since TP>1 is not very well tested with ZeRO stage 3. This certainly shows a gap in our unit tests.

I have one request. Could you please add a TODO in here for integrated testing of TP and ZeRO 1/2/3? Thanks!

Apr 17 '23 12:04 tjruwase

@YizhouZ, thanks for confirmation. That makes sense since TP>1 is not very well tested with ZeRO stage 3. This certainly shows a gap in our unit tests.

I have one request. Could you please add a TODO in here for integrated testing of TP and ZeRO 1/2/3? Thanks!

Added.

Apr 18 '23 05:04 YizhouZ

@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!

Apr 26 '23 02:04 YizhouZ

@tjruwase seems like the post_init causes issue with stage 1 as well .(after some tests).

May 03 '23 08:05 abhilash1910

@abhilash1910, I don't think that is possible since this code path is only for zero stage 3. Can you please share more details of what you are seeing, such as a stack trace?

May 03 '23 11:05 tjruwase

I think so too, this should be only in stage 3. However sometimes I do see a hang sometimes in stage 1 (not the same trace or crash, maybe a separate issue). I will revalidate on this and let you know @tjruwase .

May 03 '23 12:05 abhilash1910

@tjruwase Fixed CI failed case. Please help to check it. Thank you!

May 08 '23 08:05 YizhouZ

Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference is only readme file. And the error msg seems unreasonable and not related to my part:

______________________ TestPipeCifar10.test[topo_config0] ______________________
[gw3] linux -- Python 3.8.8 /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
Worker 0 hung.
----------------------------- Captured stdout call -----------------------------
[2023-05-08 20:11:57,287] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
----------------------------- Captured stderr call -----------------------------
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Process Process-4:
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/tests/unit/common.py", line 195, in _dist_init
    dist.barrier()
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
    return func(*args, **kwargs)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 395, in barrier
    return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 214, in barrier
    return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
  File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2526, in barrier
    work = group.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Could you please check it? Thank you!

------update------ Tried it on cuda device and cannot reproduce errors, all 3 failed tests passed.

May 09 '23 02:05 YizhouZ

@YizhouZ, apologies for the merging delay. I am confident that the CI issues are not due to your PR but due to infastructure problems. I will ensure this PR is merged, so no need to worry about it. Sorry once again for the delay, we really appreciate your contribution.

May 09 '23 11:05 tjruwase

Hello，I have other questions, in partition_parameters.py, has funcation apply_with_gather(), there are similar codes of dist.broadcast (param.data, 0, group = param.ds_process_group), isn't this okay?

May 17 '23 08:05 zte-tcb

DeepSpeed DeepSpeed copied to clipboard

Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2

DeepSpeed
DeepSpeed copied to clipboard