DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] DeepCompile: Training hang on random-sized inputs

Open eternalNight opened this issue 2 months ago • 0 comments

Description

Real-life training data may not be of the same size for every rank and at every iteration. When DeepCompile is active, training with variable-length data can hang because DeepCompile requires communication among the ranks during profiling, but:

  1. The compute graph may not be exactly the same across ranks (e.g. some have specific padding while the others don't).
  2. Guard failure (due to tensor size change) on different ranks may occur at different iterations.

To Reproduce

  1. Download https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
  2. Execute deepspeed --num_gpus=N openvla-like.py -c -r.

Expected behavior

DeepCompile works for variable-length training data.

eternalNight avatar Sep 30 '25 09:09 eternalNight