fairseq
fairseq copied to clipboard
Fix Must-C data preprocessing
Before submitting
- [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
- [x] Did you read the contributor guideline?
- [ ] Did you make sure to update the docs?
- [ ] Did you write any new necessary tests?
What does this PR do?
- Fixed calculation of input statistics. The default
--gcmvn-max-num
was too small. It didn't use the whole training samples to calculate the statistics. - Supported waveform inputs + global cmvn.
- Fixed data filtering for dev and test sets. Previously, both dev and test sets were also filtered based on the input lengths (max=3000, min=5).
- Added a
src_text
column to manifests, which would be helpful for joint ASR+ST training. - Support waveform segmentation, which is used for SimulEval, for streaming ASR
PR review
Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃