mfa-models
mfa-models copied to clipboard
Align transcript and speech (US + UK)
Hi All, Thank you for this amazing repo really nice work! We wish to align transcript and speech (english UK + US) what is the correct way to do it? If it's possible we prefer to use ARPA phone set.
Thank you in advance! @yochaiye
Just to add - we tried to used Use Case 1 with the English MFA dictionary v2_0_0 which covers both UK and US English, but this dictionary does not cover many words included in our dataset
Ok, I've added a new Use Case 2 here: https://montreal-forced-aligner.readthedocs.io/en/latest/first_steps/index.html#use-cases, with some extra functionality for expanding pronunciation dictionaries in 2.2.3, so if you update to that and run through the steps there, you should be able to expand out any of the pretrained dictionaries.
I will say that I do not recommend ARPA for UK English, given that it's only been trained on 1K hours of US English and ARPA only makes sense for US English, so I would not be surprised to see it struggle with r-lessness. The English MFA model has more UK English and world Englishes training data (though it is still slanted towards US dialects, which I'm hoping to address a bit in a new release soon ish). The English MFA dictionary contains all pronunciations for all dialects, so if you want to constrain the pronunciation space to just UK English for your UK speakers and just US for your US speakers, you can specify per-speaker dictionaries: https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/dictionary.html#per-speaker-dictionaries for use with the English MFA model.
Hope that helps!
Thank you @mmcauliffe for your fast response.
For the phase of creating OOVS file by running:
mfa g2p ~/mfa_data/my_corpus english_us_arpa ~/mfa_data/g2pped_oovs.txt --dictionary_path english_us_arpa
What is the expected structure for the corpus file? one unified txt file with multiple lines for each sample? single line with all text? (if it's non of the above please help us understand what is the correct structure)
Thanks.
The general format for corpora in MFA is https://montreal-forced-aligner.readthedocs.io/en/latest/user_guide/corpus_structure.html, but for the g2p command it'll use any text files (.txt
, .lab
, and .TextGrid
) you have in the corpus directory for constructing the word list to run G2P on.
Thank you for your response @mmcauliffe.
Our dataset contains a mix of English US and UK without the metadata of which sample is US/UK.
I guess there is no g2p
that handles such case so I wonder which use case we should follow. Would that still be case 2 or maybe case 5 is what we are looking for?