fastmoe MoE L2 norm reduce in Megatron

MoE L2 norm reduce in Megatron

Open blankde opened this issue 1 year ago • 3 comments

I notice the L2 norm for experts is reduced twice in model parallel group, please see: https://github.com/laekov/fastmoe/blob/cd8372b3a8a5e73d46d2b463ec30995631cfc181/examples/megatron/clip-grad-v2.2.patch#L44C2-L44C2. It is a good ideas to add up the square gradients of all experts. But why reduce in model parallel group here instead of data parallel group? What are the considerations?

Thanks.

Aug 10 '23 04:08 blankde

This is because the gradients are synchronized across the DP group, so they are identical. Meanwhile, the sum of a parameter tensor should be collected from the whole MP group.

Aug 10 '23 04:08 laekov

Thanks for your reply. But Megatron will reduce the total norm among MP group. see: https://github.com/NVIDIA/Megatron-LM/blob/8aa4619f2b2a57b5725026a50ebd2b15e8121482/megatron/optimizer/clip_grads.py#L105

Why we do that on moe grad individual？Will this cause double counting？

Aug 10 '23 05:08 blankde

The key point is that the experts are different in a DP group of Megatron-LM (and also MP group in previous versions of FastMoE), so we have to reduce them. So I suppose the group should be DP group instead of MP group here.

The blame shows that @zms1999 changed this code from the world comm to the mp comm. Can you please recall our intutions of doing so when making the change 9 months ago?

Aug 17 '23 01:08 laekov

fastmoe fastmoe copied to clipboard

MoE L2 norm reduce in Megatron

fastmoe
fastmoe copied to clipboard