Create audio-based language-id system
Kaldi has some existing systems for audio-based language ID (see the egs/lre* directories), but their training datasets are inaccessible. It is probably most straightforward to build one ourselves using the language labels in Mozilla Common Voice and the language labels implied by the datasets here: https://github.com/google/language-resources/
Building on top of the speech classification workflow in nemo seems like a reasonable first step: https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speech_classification/intro.html
Data augmentation is probably a must. Our data is noisier than these source datasets are. Start with SpecAugment.
Ideally the model shouldn't be super big. The idea is to get a good sense of our language breakdown based on audio, not to have a super accurate model.