chaoyang
chaoyang
### Checklist - [ ] 1. I have searched related issues but cannot get the expected help. - [ ] 2. The bug has not been fixed in the latest...

**Your question** Ask a clear and concise question about Megatron-LM. When `calculate_per_token_loss` is enabled, `finalize_model_grads` scales the gradients according to the num tokens(total number of a iter). However, I observed...