Montreal-Forced-Aligner icon indicating copy to clipboard operation
Montreal-Forced-Aligner copied to clipboard

How can I generate a pronunciation dictionary for a new language ?

Open DanojaDias opened this issue 2 years ago • 3 comments

I am trying to align a new language(Sinhalese Language) using MFA. But what I have is only a speech corpus and I do not have a pronunciation dictionary. I was going through https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/dictionary_generating.html. But according to this I need to have a pretrained G2P model for Sinhalese Language which is not available. I am not sure how to proceed after this. Could someone please help me here ?

Basically my question is, How can I generate a pronunciation dictionary for Sinhalese language without a trained G2P model ?

Thank you !

DanojaDias avatar Apr 24 '22 16:04 DanojaDias

It looks like you might be able to use https://github.com/CohenPr-XPF/XPF to generate transcriptions for Sinhalese (https://github.com/CohenPr-XPF/XPF/tree/master/Data/si_Sinhala). That's probably the easiest route. I haven't used XPF before, but it looks like it'd just be a matter of running a python script pointing to the si.rules file on the list of words in the corpus (@emilyahn correct me if I'm wrong here).

The alternative if the orthography is pretty transparent would be to just write your own rule-based G2P script (i.e. all symbols go to k a).

mmcauliffe avatar Apr 24 '22 18:04 mmcauliffe

Yes, we used XPF to generate the original pronunciations for the acoustic model and pronunciation dictionary. The tool has a nice web interface here: https://cohenpr-xpf.github.io/XPF/Convert-to-IPA.html and then it's just a small amount of post-processing to get it to match the MFA format.

echodroff avatar Apr 26 '22 16:04 echodroff

Thank you for your guidance. Meanwhile I found a pronunciation dictionary for Sinhala language from here. https://raw.githubusercontent.com/google/language-resources/master/si/data/lexicon.tsv

DanojaDias avatar May 03 '22 05:05 DanojaDias