NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Handle float limit_val_batches

Open athitten opened this issue 1 year ago • 10 comments
trafficstars

What does this PR do ?

  1. When a float limit_val_batches is passed by the user, the PR ensures that the limit_val_batches is cast to an equivalent int value and also makes sure the cast limit_val_batches is a multiple of num of microbatches as required by PTL >= 2.0.

  2. Calls self._reconfigure_val_batches() after the setup of datasets, so that the original value of limit_val_batches is used to build the dataset and not the reconfigured value.

  3. Returns the len(dataloader) in terms of number micro batches instead of num of global batches. This is required for 2 reasons:

  • a) Since limit_val_batches is reconfigured to be in terms of number of micro batches, if the len(dataloader) is in num of global batches then we can run into situations where len(dataloader) < num_micro_batches and this can lead to one of the ranks hitting StopIteration in the midst of completing a global batch, leading to a hang if a different rank is waiting for the output of the first rank in case of PP.

  • b) Another reason being ideally, the len(dataloader) should be returned in terms of the granularity of the batch size in which we fetch the data from the dataloader_iter. Since in megatron models a micro batch is fetched each time next(dataloader_iter) is called, the len(dataloader) should be returned in the same metric. Also, PTL's progress bar is such that it increments the epoch number after the num of batches extracted hit len(dataloader). So if len(dataloader) is x in terms of global batches, then we incorrectly increment the epoch after x micro batches are extracted even though, the dataloader still has microbatches and is not empty. This can be very misleading to the end users.

Add a one line overview of what this PR aims to accomplish.

Collection: [Note which collection this PR will affect]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • [ ] Make sure you read and followed Contributor guidelines
  • [ ] Did you write any new necessary tests?
  • [ ] Did you add or update any necessary documentation?
  • [ ] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ ] New Feature
  • [ ] Bugfix
  • [ ] Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed. Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

athitten avatar Feb 14 '24 23:02 athitten

jenkins

athitten avatar Feb 14 '24 23:02 athitten

jenkins

athitten avatar Feb 14 '24 23:02 athitten

jenkins

athitten avatar Feb 15 '24 00:02 athitten

jenkins

athitten avatar Feb 15 '24 00:02 athitten

jenkins

athitten avatar Feb 15 '24 02:02 athitten

jenkins

athitten avatar Feb 15 '24 07:02 athitten

jenkins

athitten avatar Feb 16 '24 01:02 athitten

jenkins

athitten avatar Feb 16 '24 02:02 athitten

jenkins

athitten avatar Feb 16 '24 23:02 athitten

jenkins

athitten avatar Feb 16 '24 23:02 athitten

jenkins

athitten avatar Feb 20 '24 03:02 athitten

jenkins

athitten avatar Feb 20 '24 04:02 athitten

jenkins

athitten avatar Feb 21 '24 03:02 athitten

jenkins

athitten avatar Feb 22 '24 19:02 athitten

jenkins

athitten avatar Feb 22 '24 19:02 athitten

jenkins

athitten avatar Feb 22 '24 19:02 athitten

jenkins

athitten avatar Feb 22 '24 19:02 athitten

jenkins

athitten avatar Feb 22 '24 22:02 athitten

jenkins

athitten avatar Feb 22 '24 23:02 athitten

jenkins

athitten avatar Feb 23 '24 00:02 athitten

LGTM

jbaczek avatar Feb 23 '24 13:02 jbaczek

jenkins

jbaczek avatar Feb 23 '24 14:02 jbaczek