NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Fix seeding in neural machine translation examples

Open sarthmit opened this issue 3 years ago • 10 comments

Signed-off-by: sarthmit [email protected]

What does this PR do ?

Fixes the seed for everything (training/initialization/etc.) during training of nlp machine translation systems to aid reproducibility.

Collection: examples/nlp/machine_translation

Changelog

  • Using pytorch_lightning's seed_everything function to fix the seed within each training file for machine translation in nlp.
  • Update the config files to start with default seed as 0

Usage

  • Usage is as per earlier, where the seed can be changed by just adding a seed=[seed] to override the default 0 in the configs.

Pre checks:

  • [x] Make sure you read and followed Contributor guidelines
  • [x] Did you write any new necessary tests?
  • [x] Did you add or update any necessary documentation?
  • [x] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
  • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [x] New Feature
  • [ ] Bugfix
  • [ ] Documentation

Who can review?

@MaximumEntropy @okuchaiev

sarthmit avatar Jul 16 '22 19:07 sarthmit

Explicit seeding for NeMo NMT is actually a bad idea because assuming you don't finish an epoch before your job ends, you will end up seeing the same examples again after resuming.

MaximumEntropy avatar Jul 18 '22 19:07 MaximumEntropy

You are right! I should do it if it is the start of training, and otherwise not?

sarthmit avatar Jul 18 '22 20:07 sarthmit

Yes that is an option. But I think the trainer.global_step or the is_being_restored flag only becomes available to you after trainer.fit() so you'd need to seed inside the model instead of in the training script. Another thing is to maybe set seed: null by default in the config and seed only if the user explicitly sets a seed?

MaximumEntropy avatar Jul 18 '22 20:07 MaximumEntropy

Ahh got it. Pushed the code for making the seed optional then instead, with default being not setting the seed.

sarthmit avatar Jul 18 '22 21:07 sarthmit

Thanks for fixing that! Can we remove seeding for Megatron NMT? In Megatron NMT, if you use text_memmap or bin_memmap datasets, the dataloader order is already deterministic and so is model init. You will see a seed (1234) already present in cfg.model in aayn_base_megatron.yaml.

MaximumEntropy avatar Jul 19 '22 18:07 MaximumEntropy

Gotcha, done that.

sarthmit avatar Jul 19 '22 23:07 sarthmit

Were you able to figure out how to seed only if the model isn't being restored?

MaximumEntropy avatar Jul 21 '22 23:07 MaximumEntropy

So we generally have Trainer created --> MTDataPreproc ---> experiment manager ---> model creation ---> training Is there a particular reason why experiment manager is called after trainer creation and data preprocessing? Do those two have any randomness and require seeding?

sarthmit avatar Jul 26 '22 17:07 sarthmit

@sarthmit did you try running a few runs to check if all of them resulted in the same plots with the same seeds? In addition to Sandeep's data across runs point, I'm also not sure if we would be able to reproduce runs due to parts of the pipeline being dependent on OS/concurrency like dataloading - some example runs might be really useful to check if this is indeed an issue or not

aklife97 avatar Aug 02 '22 20:08 aklife97

@sarthmit did you try running a few runs to check if all of them resulted in the same plots with the same seeds? In addition to Sandeep's data across runs point, I'm also not sure if we would be able to reproduce runs due to parts of the pipeline being dependent on OS/concurrency like dataloading - some example runs might be really useful to check if this is indeed an issue or not

Hey! I did do a few runs locally as well as on draco and they seemed to be reproducible with the seed set. Edit: Oops, closed the request by error.

sarthmit avatar Aug 04 '22 18:08 sarthmit

This PR was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 11 '22 02:10 github-actions[bot]