DeepSpeed
DeepSpeed copied to clipboard
Fix expert grad scaling problem with ZeRO optimizer
Fix [#6545]
work:
- expert gradient average: divide edp_world_size -> divide dp_world_size
- unit test: make sure model with different dp/ep has same expert gradient