[BUG]: Gradient clip not reducing gradient norm w/ MoE expert parallel

Open zhouyizhuang-megvii opened this issue 3 years ago • 1 comments

Hi,

When using MoE w/ gradient clip, experts' gradient norms are supposed to be summed up since it is another kind of model parallism. Otherwise different ranks would have different gradient norms, which means different gradients after clipped.

But I didn't see any special precess in the following gradient clip code:

https://github.com/hpcaitech/ColossalAI/blob/04a200573cac8900c8d682104b0e3bc2ee7ce857/colossalai/utils/common.py#L374-L383

I think this is a bug which may leads to convergence issue when using MoE w/ gradient clip.

Looking forward to your reply, thank you!

Dec 24 '22 04:12 zhouyizhuang-megvii

Hi, @zhouyizhuang-megvii

You're right. We will add the collection of the gradients of experts later in the gradient clipping.

Dec 28 '22 06:12 1SAA

We have updated a lot. This issue was closed due to inactivity. Thanks.

Apr 14 '23 09:04 binmakeswell