Wenwen Qu
Results
1
comments of
Wenwen Qu
Thanks for your reply. But Megatron will reduce the total norm among MP group. see: https://github.com/NVIDIA/Megatron-LM/blob/8aa4619f2b2a57b5725026a50ebd2b15e8121482/megatron/optimizer/clip_grads.py#L105 Why we do that on moe grad individual?Will this cause double counting?