What does this PR do ?

Refactors Canary code for Riva training

Collection: [ASR]

Changelog

Expands canary coverage for more languages. (Just keep list of iso and bcp tags updated in language_code.py
Introduces bleu metric for in-training monitoring of performance. This replaces the sacrebleu code in validation pass.
Updates wer metric to be compatible with Multitask decoding
ToDo: The WER and BLEU metrics can be combined in a single metric but involves moving all WER calls to dictionary outputs. Separate PR.
ToDo: Have to expand BLEU metric to switch between tokenizers while decoding. Separate PR.
Edits Canary dataloader for multiple paired translations. Now paired text just needs text input to be a nested dict of text: {lang1: {text:....spl_tokens:...}, lang2 {text:...spl_tokens....}....
Changes Canary tokenizer to create spl_tokens class by default.
General refactoring of multitask model. Validation and Test steps pass off to ASR model and are more consistent in design with CTC models.
Changed multitask_decoding in submodules to strip tokens by default. (Avoided extra flag in metrics logic)

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

[Y ] Make sure you read and followed Contributor guidelines
[ N] Did you write any new necessary tests?
[ N] Did you add or update any necessary documentation? TODO.
[ N] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

[ Y] New Feature
[Y ] Bugfix
[Y ] Documentation TODO

Feb 07 '24 17:02 tbartley94

jenkins

Feb 09 '24 20:02 tbartley94

jenkins

Feb 09 '24 23:02 tbartley94

@pzelasko @krishnacpuvvada Revisited tokenizer and think we can meet both needs:

There's no reason for the tokenizer to enforce what is being done in the canary prompt format. So let's just generalize task token behavior. We can pass all task tokens through the build method, then search for only formatted prompted tokens in the instantiation phase. This allows the easy lookup while also allowing easy customization for novel setups.

Meanwhile, anything regarding canary prompts in of themselves can just be maged by the lhotse prompt setup y'all have going on. Come time to expand for more wild prompts, you can just edit that setup while the tokenizer can stay as is.

Feb 16 '24 01:02 tbartley94

jenkins

Feb 21 '24 23:02 tbartley94

jenkins

Feb 22 '24 00:02 tbartley94

jenkins

Feb 22 '24 01:02 tbartley94

jenkins

Feb 22 '24 01:02 tbartley94

jenkins

Feb 22 '24 02:02 tbartley94

jenkins

Feb 22 '24 08:02 krishnacpuvvada

jenkins

Feb 23 '24 00:02 tbartley94

jenkins

Feb 23 '24 00:02 tbartley94

jenkins

Feb 23 '24 00:02 tbartley94

jenkins

Feb 23 '24 00:02 tbartley94

jenkins

Feb 23 '24 14:02 pzelasko

jenkins

Feb 23 '24 16:02 tbartley94

jenkins

Feb 23 '24 17:02 tbartley94

jenkins

Feb 23 '24 17:02 tbartley94

jenkins

Feb 23 '24 17:02 tbartley94

jenkins

Feb 23 '24 20:02 tbartley94

jenkins

Feb 23 '24 21:02 tbartley94

jenkins

Feb 23 '24 21:02 tbartley94

jenkins

Feb 23 '24 23:02 tbartley94

jenkins

Feb 24 '24 00:02 tbartley94

Canary refactor for Riva

What does this PR do ?

Changelog

Jenkins CI

Before your PR is "Ready for review"