wyooyw
Results
4
issues of
wyooyw
MultiSTGCnet 18182672 WangYiou
Fix [#6545] work: - expert gradient average: divide edp_world_size -> divide dp_world_size - unit test: make sure model with different dp/ep has same expert gradient
**Describe the bug** When using ZeRO optimizer training MoE model, the gradient of the expert weights is **ep_size times larger than** the true gradient. **Related issue & pr** Issue [#5618]...
bug
training