[QUESTION] mtp_loss is scaled incorrectly

Open bisunny opened this issue 1 month ago • 0 comments

Your question Ask a clear and concise question about Megatron-LM. When calculate_per_token_loss is enabled, finalize_model_grads scales the gradients according to the num tokens（total number of a iter). However, I observed that the number of tokens (loss mask) corresponding to mtp_loss might be inconsistent with num_tokens in finalize_model_grads. I think there is a bug.

Nov 27 '25 02:11 bisunny