Fix Megatron losses that were incorrect for micro_batch_size > 1

Open odelalleau opened this issue 5 months ago • 1 comments

What does this PR do ?

Fixes the losses of several Megatron models, that were incorrect for micro_batch_size > 1

Collection: nlp, multimodal

Changelog

Fix losses of several Megatron models (MegatronGPTModel, MegatronNevaModel, MegatronBertModel, MegatronLMEncoderDecoderModel) that were incorrect for micro_batch_size > 1

What still needs to be done

[ ] Probably that some new tests are needed. I first want to check which tests fail to identify potential gaps in testing.
[ ] There is a # TODO does this also need fixing? in MegatronBertModel as I'm not familiar with the Sequence Order Prediction loss. Would be helpful if someone could double check it, or at least tell me what are the shapes of sop_logits and sentence_order so I can figure it out.
[ ] Check whether the handling of context_parallel_size > 1 is correct in MegatronGPTModel

Other breaking changes

While fixing this problem I ran into some related issues that forced me to make some design choices leading to "collateral" breaking changes:

The validation loss in MegatronGPTModel with validation_drop_last=False was inconsistent with drop_last=True since it was computing a per-token average across the global batch (at least I believe that's what's happening here), while with drop_last=True it was using the (incorrect) per-token average across the micro-batch. I changed it so that the behavior should now be a consistent per-sample average across valid samples.
The loss could be NaN when all samples in a micro-batch would be fully masked (which typically would only happen with micro_batch_size=1). This also led to inconsistency between different micro-batch sizes. I changed it so that there is no NaN anymore, instead fully masked samples have a zero loss. The alternative would be to trigger a NaN as soon as one sample is fully masked, but I decided against it because (i) in order to handle validation with validation_drop_last=False we would need to make the loss function depend on this flag, which would make the code more complex (and potentially inconsistent in the presence of fully masked samples in the validation set), and (ii) NaN losses due to fully masked samples is a recurring pain point currently (it forces users to either go through a cumbersome model-dependent data filtering step, or hack the NeMo codebase to work around it).
(Follow-up to 2, splitting it for readability) That being said, I think that silently ignoring some samples is also potentially dangerous, so I added a warning when the loss mask is all zeros for some samples. I decided to do it at the dataset level rather than in the model so as to avoid forcing a CPU-GPU synch point on such a check. We could add a flag in the datasets to control whether this is an error or a warning, but I thought I'd get some feedback first.

Before your PR is "Ready for review"

Pre checks:

[X] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ ] New Feature
[X] Bugfix
[ ] Documentation

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.