setting sync_dist to true for validation metrics
What does this PR do ?
setting sync_dist=True for on_validation_epoch_end in ModelPT.py
This is in response to the recent slack discussion on "wiping out of checkpoints mid training and starting from scratch" in Canary training.
Collection: [all] - The change is in ModelPT.py
Changelog
- setting sync_dist=true in
on_validation_epoch_end
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ ] New Feature
- [X] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
Looks good.
This can cause hands, but for now let's try it.
@titu1994 can you elaborate? do you expect an uneven number of validation batches per node (why), or something else?
Yes, the reason to avoid sync is two fold - I think I remember random hangs years ago due to a similar flag, dunno if it's still an issue now. Second one is it causes a global sync across all ranks every single step - which causes a large slowdown once you scale up nodes
This is on validation epoch end so only a single sync per validation loop. That should be OK both hang-wise and speed-wise.
Makes sense
Another note, from my experience, sync_dist=True is tricky with parallelisms, especially pipeline parallelisms. If some metrics are non-existent on some ranks (eg val_loss is only on last pp stage, so we have to sync it), this could cause a hang. If the metric is 0.0 on some ranks, the mean reduction over ranks could be wrong or not produce intended value.
This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.
This PR was closed because it has been inactive for 7 days since being marked as stale.
@krishnacpuvvada @titu1994 what's the verdict on this one? Do we need a Canary-specific mechanism to synchronize the validation metrics if it can't be done at the modelPT level?
(1) Don't push this to modelPT. (2) If Canary needs it, it can override this function specifically and copy paste the code (for now).
Do note - @nithinraok plans to remove modelpt level support for multi validation and test data loader, which basically would mean every model has to handle this stuff manually anyway.