fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Fix Must-C data preprocessing

Open hirofumi0810 opened this issue 1 year ago • 0 comments

Before submitting

  • [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • [x] Did you read the contributor guideline?
  • [ ] Did you make sure to update the docs?
  • [ ] Did you write any new necessary tests?

What does this PR do?

  • Fixed calculation of input statistics. The default --gcmvn-max-num was too small. It didn't use the whole training samples to calculate the statistics.
  • Supported waveform inputs + global cmvn.
  • Fixed data filtering for dev and test sets. Previously, both dev and test sets were also filtered based on the input lengths (max=3000, min=5).
  • Added a src_text column to manifests, which would be helpful for joint ASR+ST training.
  • Support waveform segmentation, which is used for SimulEval, for streaming ASR

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

hirofumi0810 avatar Oct 07 '22 20:10 hirofumi0810