IMS-Toucan
IMS-Toucan copied to clipboard
Wrong parsing of AISHELL3 text
I see that on line https://github.com/DigitalPhonetics/IMS-Toucan/blob/v2.5/Utility/path_to_transcript_dicts.py#L478, you are treating the "%" characters in the AISHELL3 transcript as if they are commas. However, they are actually "word delimiters" and do not indicate any pause between the words. (Chinese words are made of typically 1-4 characters).
I see, thank you for pointing that out!
Do you know of any other way of approximating pause locations in the AISHELL3 transcripts?
It seems that breaks are just not notated in the AISHELL3 dataset. But most of the sentences are fairly short and are in fact without breaks, it seems.
If you use something like MFA (https://montreal-forced-aligner.readthedocs.io/en/latest/getting_started.html), you can probably work out where the breaks are. Or I suppose you could manually combine the shorter sentences so as to have a train dataset that contains breaks...