fix sequence parallel(Ulysses) grad scale for zero0

Open inkcherry opened this issue 1 year ago • 0 comments

use dp_world_size for grad reduction, instead of seq_dp_world_size. Currently, for zero0, only sparse tensors use the correct world_size.

tiny model with sp=4 grad norm test:

grad_norm	step1	step2	step3	step4	step5	step100
zero1	15.825	16.646	15.853	16.159	17.333	15.555
zero0	3.956	4.161	3.963	4.040	4.333	3.889
zero0(this patch)	15.825	16.646	15.853	16.159	17.333	15.554

May 21 '24 07:05 inkcherry