NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Canary refactor for Riva

Open tbartley94 opened this issue 1 year ago • 3 comments

What does this PR do ?

  • Refactors Canary code for Riva training

Collection: [ASR]

Changelog

  • Expands canary coverage for more languages. (Just keep list of iso and bcp tags updated in language_code.py
  • Introduces bleu metric for in-training monitoring of performance. This replaces the sacrebleu code in validation pass.
  • Updates wer metric to be compatible with Multitask decoding
  • ToDo: The WER and BLEU metrics can be combined in a single metric but involves moving all WER calls to dictionary outputs. Separate PR.
  • ToDo: Have to expand BLEU metric to switch between tokenizers while decoding. Separate PR.
  • Edits Canary dataloader for multiple paired translations. Now paired text just needs text input to be a nested dict of text: {lang1: {text:....spl_tokens:...}, lang2 {text:...spl_tokens....}....
  • Changes Canary tokenizer to create spl_tokens class by default.
  • General refactoring of multitask model. Validation and Test steps pass off to ASR model and are more consistent in design with CTC models.
  • Changed multitask_decoding in submodules to strip tokens by default. (Avoided extra flag in metrics logic)

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

  • [Y ] Make sure you read and followed Contributor guidelines
  • [ N] Did you write any new necessary tests?
  • [ N] Did you add or update any necessary documentation? TODO.
  • [ N] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • [ Y] New Feature
  • [Y ] Bugfix
  • [Y ] Documentation TODO

tbartley94 avatar Feb 07 '24 17:02 tbartley94

jenkins

tbartley94 avatar Feb 09 '24 20:02 tbartley94

jenkins

tbartley94 avatar Feb 09 '24 23:02 tbartley94

@pzelasko @krishnacpuvvada Revisited tokenizer and think we can meet both needs:

There's no reason for the tokenizer to enforce what is being done in the canary prompt format. So let's just generalize task token behavior. We can pass all task tokens through the build method, then search for only formatted prompted tokens in the instantiation phase. This allows the easy lookup while also allowing easy customization for novel setups.

Meanwhile, anything regarding canary prompts in of themselves can just be maged by the lhotse prompt setup y'all have going on. Come time to expand for more wild prompts, you can just edit that setup while the tokenizer can stay as is.

tbartley94 avatar Feb 16 '24 01:02 tbartley94

jenkins

tbartley94 avatar Feb 21 '24 23:02 tbartley94

jenkins

tbartley94 avatar Feb 22 '24 00:02 tbartley94

jenkins

tbartley94 avatar Feb 22 '24 01:02 tbartley94

jenkins

tbartley94 avatar Feb 22 '24 01:02 tbartley94

jenkins

tbartley94 avatar Feb 22 '24 02:02 tbartley94

jenkins

krishnacpuvvada avatar Feb 22 '24 08:02 krishnacpuvvada

jenkins

tbartley94 avatar Feb 23 '24 00:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 00:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 00:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 00:02 tbartley94

jenkins

pzelasko avatar Feb 23 '24 14:02 pzelasko

jenkins

tbartley94 avatar Feb 23 '24 16:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 17:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 17:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 17:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 20:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 21:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 21:02 tbartley94

jenkins

tbartley94 avatar Feb 23 '24 23:02 tbartley94

jenkins

tbartley94 avatar Feb 24 '24 00:02 tbartley94