DeepSpeed
DeepSpeed copied to clipboard
Fix broadcast error on multi-node training with ZeroStage3 and TensorParallel=2
Hi, while enabling TensorParallel=2 and ZeroStage3 on multi-node training for Megatron-DeepSpeed, I encountered error on this bcast RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}") In TensorParallel=2, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So I believe this bcast should use local rank 0 inside current self.ds_process_group.
@tjruwase Hi, We found a bug in DeepSpeed that when enabling tensor parallel = 2 on Megatron-DeepSpeed 20B 4nodes, would meet below error: RuntimeError: Global rank 0 is not part of group <torch.distributed.ProcessGroupCCL object at 0x14e99ef52c30> raise RuntimeError(f"Global rank {global_rank} is not part of group {group}") Under this case, ds_process_group would be [0, 2, 4, ... ] and [1, 3, 5, ...] which does not contain global rank 0. So we believe this bcast should use local rank 0 inside current self.ds_process_group rather than global rank 0. After this fix, the result will be broadcasting parameters inside each ds_process_group from local rank 0 to the rest of ranks.
Do you have any comments on this PR?
@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.
@YizhouZ, thanks for this PR. Apologies for the delay as we resolve some CI issues. We plan to merge soon.
@tjruwase Thanks!
@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?
@YizhouZ, do you know why this is not a problem for zero stage 1 or 2?
Hi @tjruwase only stage 3 would trigger this post_init_method, others would not go into this place as far as my test result shows.
@YizhouZ, thanks for confirmation. That makes sense since TP>1 is not very well tested with ZeRO stage 3. This certainly shows a gap in our unit tests.
I have one request. Could you please add a TODO in here for integrated testing of TP and ZeRO 1/2/3? Thanks!
@YizhouZ, thanks for confirmation. That makes sense since TP>1 is not very well tested with ZeRO stage 3. This certainly shows a gap in our unit tests.
I have one request. Could you please add a TODO in here for integrated testing of TP and ZeRO 1/2/3? Thanks!
Added.
@tjruwase Could you please help me trigger the CI? My CLA was reviewed and passed today. Thank you!
@tjruwase seems like the post_init causes issue with stage 1 as well .(after some tests).
@abhilash1910, I don't think that is possible since this code path is only for zero stage 3. Can you please share more details of what you are seeing, such as a stack trace?
I think so too, this should be only in stage 3. However sometimes I do see a hang sometimes in stage 1 (not the same trace or crash, maybe a separate issue). I will revalidate on this and let you know @tjruwase .
@tjruwase Fixed CI failed case. Please help to check it. Thank you!
Hi @tjruwase, it seems the current CI failure is not triggered by my changes, I see the previous check is passed but the latest one is failed and the difference is only readme file. And the error msg seems unreasonable and not related to my part:
______________________ TestPipeCifar10.test[topo_config0] ______________________
[gw3] linux -- Python 3.8.8 /tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/bin/python
Worker 0 hung.
----------------------------- Captured stdout call -----------------------------
[2023-05-08 20:11:57,287] [INFO] [comm.py:622:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
----------------------------- Captured stderr call -----------------------------
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 3 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Process Process-4:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/tests/unit/common.py", line 195, in _dist_init
dist.barrier()
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 120, in log_wrapper
return func(*args, **kwargs)
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/comm.py", line 395, in barrier
return cdb.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/deepspeed/comm/torch.py", line 214, in barrier
return torch.distributed.barrier(group=group, async_op=async_op, device_ids=device_ids)
File "/tmp/actions-runner/_work/DeepSpeed/DeepSpeed/unit-test-venv/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2526, in barrier
work = group.barrier(opts=opts)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Could you please check it? Thank you!
------update------ Tried it on cuda device and cannot reproduce errors, all 3 failed tests passed.
@YizhouZ, apologies for the merging delay. I am confident that the CI issues are not due to your PR but due to infastructure problems. I will ensure this PR is merged, so no need to worry about it. Sorry once again for the delay, we really appreciate your contribution.
Hello,I have other questions, in partition_parameters.py, has funcation apply_with_gather(), there are similar codes of dist.broadcast (param.data, 0, group = param.ds_process_group), isn't this okay?