Canary refactor for Riva
What does this PR do ?
- Refactors Canary code for Riva training
Collection: [ASR]
Changelog
- Expands canary coverage for more languages. (Just keep list of iso and bcp tags updated in
language_code.py - Introduces bleu metric for in-training monitoring of performance. This replaces the sacrebleu code in validation pass.
- Updates wer metric to be compatible with Multitask decoding
- ToDo: The WER and BLEU metrics can be combined in a single metric but involves moving all WER calls to dictionary outputs. Separate PR.
- ToDo: Have to expand BLEU metric to switch between tokenizers while decoding. Separate PR.
- Edits Canary dataloader for multiple paired translations. Now paired text just needs text input to be a nested dict of
text: {lang1: {text:....spl_tokens:...}, lang2 {text:...spl_tokens....}.... - Changes Canary tokenizer to create
spl_tokensclass by default. - General refactoring of multitask model. Validation and Test steps pass off to ASR model and are more consistent in design with CTC models.
- Changed multitask_decoding in submodules to strip tokens by default. (Avoided extra flag in metrics logic)
Jenkins CI
To run Jenkins, a NeMo User with write access must comment jenkins on the PR.
Before your PR is "Ready for review"
Pre checks:
- [Y ] Make sure you read and followed Contributor guidelines
- [ N] Did you write any new necessary tests?
- [ N] Did you add or update any necessary documentation? TODO.
- [ N] Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- [ ] Reviewer: Does the PR have correct import guards for all optional libraries?
PR Type:
- [ Y] New Feature
- [Y ] Bugfix
- [Y ] Documentation TODO
jenkins
jenkins
@pzelasko @krishnacpuvvada Revisited tokenizer and think we can meet both needs:
There's no reason for the tokenizer to enforce what is being done in the canary prompt format. So let's just generalize task token behavior. We can pass all task tokens through the build method, then search for only formatted prompted tokens in the instantiation phase. This allows the easy lookup while also allowing easy customization for novel setups.
Meanwhile, anything regarding canary prompts in of themselves can just be maged by the lhotse prompt setup y'all have going on. Come time to expand for more wild prompts, you can just edit that setup while the tokenizer can stay as is.
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins
jenkins