Create audio-based language-id system

Open galv opened this issue 4 years ago • 0 comments

Kaldi has some existing systems for audio-based language ID (see the egs/lre* directories), but their training datasets are inaccessible. It is probably most straightforward to build one ourselves using the language labels in Mozilla Common Voice and the language labels implied by the datasets here: https://github.com/google/language-resources/

Building on top of the speech classification workflow in nemo seems like a reasonable first step: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html

Data augmentation is probably a must. Our data is noisier than these source datasets are. Start with SpecAugment.

Ideally the model shouldn't be super big. The idea is to get a good sense of our language breakdown based on audio, not to have a super accurate model.

Jun 23 '21 19:06 galv