DeepSpeed
DeepSpeed copied to clipboard
fix sequence parallel(Ulysses) grad scale for zero0
use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size.
tiny model with sp=4 grad norm test:
| grad_norm | step1 | step2 | step3 | step4 | step5 | step100 |
|---|---|---|---|---|---|---|
| zero1 | 15.825 | 16.646 | 15.853 | 16.159 | 17.333 | 15.555 |
| zero0 | 3.956 | 4.161 | 3.963 | 4.040 | 4.333 | 3.889 |
| zero0(this patch) | 15.825 | 16.646 | 15.853 | 16.159 | 17.333 | 15.554 |