Montreal-Forced-Aligner
Montreal-Forced-Aligner copied to clipboard
How can I generate a pronunciation dictionary for a new language ?
I am trying to align a new language(Sinhalese Language) using MFA. But what I have is only a speech corpus and I do not have a pronunciation dictionary. I was going through https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/workflows/dictionary_generating.html. But according to this I need to have a pretrained G2P model for Sinhalese Language which is not available. I am not sure how to proceed after this. Could someone please help me here ?
Basically my question is, How can I generate a pronunciation dictionary for Sinhalese language without a trained G2P model ?
Thank you !
It looks like you might be able to use https://github.com/CohenPr-XPF/XPF to generate transcriptions for Sinhalese (https://github.com/CohenPr-XPF/XPF/tree/master/Data/si_Sinhala). That's probably the easiest route. I haven't used XPF before, but it looks like it'd just be a matter of running a python script pointing to the si.rules
file on the list of words in the corpus (@emilyahn correct me if I'm wrong here).
The alternative if the orthography is pretty transparent would be to just write your own rule-based G2P script (i.e. all ක
symbols go to k a
).
Yes, we used XPF to generate the original pronunciations for the acoustic model and pronunciation dictionary. The tool has a nice web interface here: https://cohenpr-xpf.github.io/XPF/Convert-to-IPA.html and then it's just a small amount of post-processing to get it to match the MFA format.
Thank you for your guidance. Meanwhile I found a pronunciation dictionary for Sinhala language from here. https://raw.githubusercontent.com/google/language-resources/master/si/data/lexicon.tsv