DeepSpeed [BUG] Expert gradient scaling problem with ZeRO optimizer

[BUG] Expert gradient scaling problem with ZeRO optimizer

Open wyooyw opened this issue 5 months ago • 1 comments

Describe the bug

When using ZeRO optimizer training MoE model, the gradient of the expert weights is ep_size times larger than the true gradient.

Related issue & pr Issue [#5618] has described the bug (the second bug in that issue). However, it has been closed. So I create a new issue here PR [#5259] has fix the bug in bf16 optimizer. ZeRO optimizer also needs to be fixed：

To Reproduce

1.Prepare two models(model1 & model2) using the same input data and initial parameters. They all use ZeRO 1( or 2) optimizer. Model1 uses ep=1, model2 uses ep=2. 2.Perform a forward and backward propagation on both models. 3.Dump the gradient of the expert weights from both models. 4.The gradient of the expert weights in model2 is ep_size times that of model1.

Expected behavior Gradient should be same under different ep_size.

Sep 17 '24 15:09 wyooyw

DeepSpeed DeepSpeed copied to clipboard

[BUG] Expert gradient scaling problem with ZeRO optimizer

DeepSpeed
DeepSpeed copied to clipboard