wyooyw

Results 4 issues of wyooyw

MultiSTGCnet 18182672 WangYiou

Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient

**Describe the bug** When using ZeRO optimizer training MoE model, the gradient of the expert weights is **ep_size times larger than** the true gradient. **Related issue & pr** Issue [#5618]...

bug
training