Megatron-LM
Megatron-LM copied to clipboard
[QUESTION] mtp_loss is scaled incorrectly
Your question
Ask a clear and concise question about Megatron-LM.
When calculate_per_token_loss is enabled, finalize_model_grads scales the gradients according to the num tokens(total number of a iter). However, I observed that the number of tokens (loss mask) corresponding to mtp_loss might be inconsistent with num_tokens in finalize_model_grads. I think there is a bug.