DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

fix sequence parallel(Ulysses) grad scale for zero0

Open inkcherry opened this issue 1 year ago • 0 comments

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm step1 step2 step3 step4 step5 step100
zero1 15.825 16.646 15.853 16.159 17.333 15.555
zero0 3.956 4.161 3.963 4.040 4.333 3.889
zero0(this patch) 15.825 16.646 15.853 16.159 17.333 15.554

inkcherry avatar May 21 '24 07:05 inkcherry