[BUG]: Gradient clip not reducing gradient norm w/ MoE expert parallel
Hi,
When using MoE w/ gradient clip, experts' gradient norms are supposed to be summed up since it is another kind of model parallism. Otherwise different ranks would have different gradient norms, which means different gradients after clipped.
But I didn't see any special precess in the following gradient clip code:
https://github.com/hpcaitech/ColossalAI/blob/04a200573cac8900c8d682104b0e3bc2ee7ce857/colossalai/utils/common.py#L374-L383
I think this is a bug which may leads to convergence issue when using MoE w/ gradient clip.
Looking forward to your reply, thank you!
Hi, @zhouyizhuang-megvii
You're right. We will add the collection of the gradients of experts later in the gradient clipping.
We have updated a lot. This issue was closed due to inactivity. Thanks.