NeMo
NeMo copied to clipboard
Handle float limit_val_batches
What does this PR do ?
-
When a float
limit_val_batchesis passed by the user, the PR ensures that thelimit_val_batchesis cast to an equivalent int value and also makes sure the castlimit_val_batchesis a multiple of num of microbatches as required by PTL >= 2.0. -
Calls
self._reconfigure_val_batches()after the setup of datasets, so that the original value of limit_val_batches is used to build the dataset and not the reconfigured value. -
Returns the
len(dataloader)in terms of number micro batches instead of num of global batches. This is required for 2 reasons:
-
a) Since
limit_val_batchesis reconfigured to be in terms of number of micro batches, if thelen(dataloader)is in num of global batches then we can run into situations wherelen(dataloader) < num_micro_batchesand this can lead to one of the ranks hittingStopIterationin the midst of completing a global batch, leading to a hang if a different rank is waiting for the output of the first rank in case of PP. -
b) Another reason being ideally, the
len(dataloader)should be returned in terms of the granularity of the batch size in which we fetch the data from thedataloader_iter. Since in megatron models a micro batch is fetched each timenext(dataloader_iter)is called, the len(dataloader) should be returned in the same metric. Also, PTL's progress bar is such that it increments the epoch number after the num of batches extracted hitlen(dataloader). So iflen(dataloader)is x in terms of global batches, then we incorrectly increment the epoch after x micro batches are extracted even though, the dataloader still has microbatches and is not empty. This can be very misleading to the end users.
Add a one line overview of what this PR aims to accomplish.
Collection: [Note which collection this PR will affect]
Changelog
- Add specific line by line info of high level changes in this PR.
Usage
- You can potentially add a usage example below
# Add a code snippet demonstrating how to use this
Jenkins CI
To run Jenkins, a NeMo User with write access must comment jenkins on the PR.
Before your PR is "Ready for review"
Pre checks:
- [ ] Make sure you read and followed Contributor guidelines
- [ ] Did you write any new necessary tests?
- [ ] Did you add or update any necessary documentation?
- [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ ] New Feature
- [ ] Bugfix
- [ ] Documentation
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information
- Related to # (issue)
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
LGTM
jenkins