What does this PR do ?

setting sync_dist=True for on_validation_epoch_end in ModelPT.py This is in response to the recent slack discussion on "wiping out of checkpoints mid training and starting from scratch" in Canary training.

Collection: [all] - The change is in ModelPT.py

Changelog

setting sync_dist=true in on_validation_epoch_end

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR. To re-run CI remove and add the label again. To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

[ ] Make sure you read and followed Contributor guidelines
[ ] Did you write any new necessary tests?
[ ] Did you add or update any necessary documentation?
[ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ ] New Feature
[X] Bugfix
[ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Aug 26 '24 16:08 krishnacpuvvada

Looks good.

This can cause hands, but for now let's try it.

@titu1994 can you elaborate? do you expect an uneven number of validation batches per node (why), or something else?

Aug 26 '24 16:08 pzelasko

Yes, the reason to avoid sync is two fold - I think I remember random hangs years ago due to a similar flag, dunno if it's still an issue now. Second one is it causes a global sync across all ranks every single step - which causes a large slowdown once you scale up nodes

Aug 26 '24 16:08 titu1994

This is on validation epoch end so only a single sync per validation loop. That should be OK both hang-wise and speed-wise.

Aug 26 '24 21:08 pzelasko

Makes sense

Aug 26 '24 21:08 titu1994

Another note, from my experience, sync_dist=True is tricky with parallelisms, especially pipeline parallelisms. If some metrics are non-existent on some ranks (eg val_loss is only on last pp stage, so we have to sync it), this could cause a hang. If the metric is 0.0 on some ranks, the mean reduction over ranks could be wrong or not produce intended value.

Aug 26 '24 21:08 maanug-nv

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

Sep 10 '24 01:09 github-actions[bot]

This PR was closed because it has been inactive for 7 days since being marked as stale.

Sep 18 '24 01:09 github-actions[bot]

@krishnacpuvvada @titu1994 what's the verdict on this one? Do we need a Canary-specific mechanism to synchronize the validation metrics if it can't be done at the modelPT level?

Oct 22 '24 14:10 pzelasko

(1) Don't push this to modelPT. (2) If Canary needs it, it can override this function specifically and copy paste the code (for now).

Do note - @nithinraok plans to remove modelpt level support for multi validation and test data loader, which basically would mean every model has to handle this stuff manually anyway.

Oct 22 '24 17:10 titu1994

setting sync_dist to true for validation metrics

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information