DeepSpeed
DeepSpeed copied to clipboard
[BUG] DeepCompile: Training hang on random-sized inputs
Description
Real-life training data may not be of the same size for every rank and at every iteration. When DeepCompile is active, training with variable-length data can hang because DeepCompile requires communication among the ranks during profiling, but:
- The compute graph may not be exactly the same across ranks (e.g. some have specific padding while the others don't).
- Guard failure (due to tensor size change) on different ranks may occur at different iterations.
To Reproduce
- Download https://gist.github.com/eternalNight/3c2cf8c703f1e9e7742d3b7f9e1edae3
- Execute
deepspeed --num_gpus=N openvla-like.py -c -r.
Expected behavior
DeepCompile works for variable-length training data.