DeepSpeed
DeepSpeed copied to clipboard
[BUG] Deepspeed Training with Stage 3 job hang and fail
Describe the bug A clear and concise description of what the bug is.
DeepSpeed training with staging 3 led to job hanging randomly with empty GPU usage on certain workers:

and then later failed with runtime error:
7650 File "/usr/local/lib/python3.8/dist-packages/accelerate/data_loader.py", line 369, in __iter__7651 synchronize_rng_states(self.rng_types, self.synchronized_generator)7652 File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/random.py", line 89, in synchronize_rng_states7653 synchronize_rng_state(RNGType(rng_type), generator=generator)7654 File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/random.py", line 84, in synchronize_rng_state7655 generator.set_state(rng_state)7656RuntimeError: Invalid mt19937 state
To Reproduce Steps to reproduce the behavior:
Train HF model (https://huggingface.co/docs/transformers/model_doc/gpt_neox) with 2 P4 nodes and 8 A100 each.
Expected behavior A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report
to give us details about your setup.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Docker context Are you using a specific docker image that you can share?
Additional context Add any other context about the problem here.
Same issue.
@shaowei-su and @stgzr - taking a look now
Hi @shaowei-su - a number of fixes have gone in, could you re-run and confirm that you still have this issue? Apologies for taking so long to get to this.
@shaowei-su or @stgzr - I don't have access to any P4 GPUs, do you know if this will repro with just the A100s?
Thanks @loadams for helping on this issue - I did some investigation in the last few weeks and I think the root cause is related to the orchestrator (Ray Air: https://docs.ray.io/en/latest/ray-air/getting-started.html) setting the ACCELERATE_TORCH_DEVICE
and LOCAL_RANK
that occasionally does not match what deepspeed expects.
Could you point me to the fixes you mentioned above? Also happy to close this issue since it's not directly related to DeepSpeed.
Thanks! Apologies, I was confusing a fix for a different GH issue. But I guess its good to know if it reproduces with the latest DeepSpeed for debugging but seems like it still is there.
I think its worth doing a bit more digging to see, or at least to try and understand what is going on here.
But do you know if this can repro with just a subset of your devices if I don't have both P4's and A100s?
@shaowei-su - have you tried reaching out to the accelerate folks on this since their data loader is what is throwing the error? Since I may not be able to repro it without the hardware.
Hi @loadams , to reproduce the failure it requires a multi-node & multi-gpu Ray cluster setup (https://github.com/ray-project/ray). Usually the issue occurs with > 16 GPUs (a100 or a10g, this is not GPU model specific). Closing this issue since it's not rooted in deepspeed, thanks!