IMS-Toucan icon indicating copy to clipboard operation
IMS-Toucan copied to clipboard

Wrong parsing of AISHELL3 text

Open tshmak opened this issue 1 year ago • 3 comments

I see that on line https://github.com/DigitalPhonetics/IMS-Toucan/blob/v2.5/Utility/path_to_transcript_dicts.py#L478, you are treating the "%" characters in the AISHELL3 transcript as if they are commas. However, they are actually "word delimiters" and do not indicate any pause between the words. (Chinese words are made of typically 1-4 characters).

tshmak avatar Aug 15 '23 07:08 tshmak

I see, thank you for pointing that out!

Do you know of any other way of approximating pause locations in the AISHELL3 transcripts?

Flux9665 avatar Aug 15 '23 17:08 Flux9665

It seems that breaks are just not notated in the AISHELL3 dataset. But most of the sentences are fairly short and are in fact without breaks, it seems.

tshmak avatar Aug 16 '23 02:08 tshmak

If you use something like MFA (https://montreal-forced-aligner.readthedocs.io/en/latest/getting_started.html), you can probably work out where the breaks are. Or I suppose you could manually combine the shorter sentences so as to have a train dataset that contains breaks...

tshmak avatar Aug 16 '23 02:08 tshmak