DeepSpeed [BUG] Deepspeed Training with Stage 3 job hang and fail

Describe the bug A clear and concise description of what the bug is.

DeepSpeed training with staging 3 led to job hanging randomly with empty GPU usage on certain workers:

and then later failed with runtime error:

7650  File "/usr/local/lib/python3.8/dist-packages/accelerate/data_loader.py", line 369, in __iter__7651    synchronize_rng_states(self.rng_types, self.synchronized_generator)7652  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/random.py", line 89, in synchronize_rng_states7653    synchronize_rng_state(RNGType(rng_type), generator=generator)7654  File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/random.py", line 84, in synchronize_rng_state7655    generator.set_state(rng_state)7656RuntimeError: Invalid mt19937 state

To Reproduce Steps to reproduce the behavior:

Train HF model (https://huggingface.co/docs/transformers/model_doc/gpt_neox) with 2 P4 nodes and 8 A100 each.

Expected behavior A clear and concise description of what you expected to happen.

ds_report output Please run ds_report to give us details about your setup.

async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  please install triton==1.0.0 if you want to use sparse attention
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.13.1+cu117
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.9.2, unknown, unknown
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU count and types [e.g. two machines with x8 A100s each]
Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
Python version
Any other relevant info about your setup

Launcher context Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context Are you using a specific docker image that you can share?

Additional context Add any other context about the problem here.

May 07 '23 22:05 shaowei-su

Same issue.

May 16 '23 02:05 stgzr

@shaowei-su and @stgzr - taking a look now

Jun 06 '23 17:06 loadams

Hi @shaowei-su - a number of fixes have gone in, could you re-run and confirm that you still have this issue? Apologies for taking so long to get to this.

Jul 07 '23 22:07 loadams

@shaowei-su or @stgzr - I don't have access to any P4 GPUs, do you know if this will repro with just the A100s?

Jul 07 '23 23:07 loadams

Thanks @loadams for helping on this issue - I did some investigation in the last few weeks and I think the root cause is related to the orchestrator (Ray Air: https://docs.ray.io/en/latest/ray-air/getting-started.html) setting the ACCELERATE_TORCH_DEVICE and LOCAL_RANK that occasionally does not match what deepspeed expects.

Jul 07 '23 23:07 shaowei-su

Could you point me to the fixes you mentioned above? Also happy to close this issue since it's not directly related to DeepSpeed.

Jul 07 '23 23:07 shaowei-su

Thanks! Apologies, I was confusing a fix for a different GH issue. But I guess its good to know if it reproduces with the latest DeepSpeed for debugging but seems like it still is there.

I think its worth doing a bit more digging to see, or at least to try and understand what is going on here.

But do you know if this can repro with just a subset of your devices if I don't have both P4's and A100s?

Jul 10 '23 16:07 loadams

@shaowei-su - have you tried reaching out to the accelerate folks on this since their data loader is what is throwing the error? Since I may not be able to repro it without the hardware.

Jul 10 '23 19:07 loadams

Hi @loadams , to reproduce the failure it requires a multi-node & multi-gpu Ray cluster setup (https://github.com/ray-project/ray). Usually the issue occurs with > 16 GPUs (a100 or a10g, this is not GPU model specific). Closing this issue since it's not rooted in deepspeed, thanks!

Jul 16 '23 20:07 shaowei-su

DeepSpeed DeepSpeed copied to clipboard

[BUG] Deepspeed Training with Stage 3 job hang and fail

DeepSpeed
DeepSpeed copied to clipboard