nnmnkwii
nnmnkwii copied to clipboard
Document how to build speech synthesis system for new languages
All you need is that
- Wav files
- Full-context labels
- HTS-style question file
With those all prepared, it should be very straightforward to implement.
Hi thank you for this work. About this issue: Is the above recipe complete in the case of tonal languages? esp. those with rising and falling tones/pitch on vowels, nasal consonants?
Yes. I think you will need to annotate the accent information (rising/failing tones) as the HTS-style label.
Hi, 山本さん。I'm trying to synthesize mandarin using your tool. To my knowledge, i need to do forced alignment manually in previous. And then writing a frontend that adapts the language to extract linguistic features. So does that mean i only need to replace the frontend part? And could i using other forced alignment tools such as "montreal" at other alignment level which is neither 'state' nor 'phone', for example, 'syllable'?
こんにちは、 @attitudechunfeng !
So does that mean i only need to replace the frontend part?
Yes, you can reuse other parts. You can also reuse a part of frontend (https://r9y9.github.io/nnmnkwii/latest/references/frontend.html#frontend) to convert your linguistic features to its numeric representation at either phone, state or frame-level if you use the HTS-style label format.
And could i using other forced alignment tools such as "montreal" at other alignment level which is neither 'state' nor 'phone', for example, 'syllable'?
You could, but then you cannot reuse https://r9y9.github.io/nnmnkwii/latest/references/frontend.html#frontend, since it assumes state or phone-level alignment.
本当にありがとう!I'll try it.
Alternatively, you could consider end-to-end approach, which doesn't require alignment as well as linguistic feature extraction (the hard part of the TTS!). See https://github.com/r9y9/deepvoice3_pytorch if you are interested.
Thank u. In fact, i'm also following your other excellent tts projects. However, i'm now trying do some work about offline usage, end-to-end models are not convenient to be transferred to mobile devices and its speed on cpu is also can't be guaranteed. So i have to use traditional method.
I see. I hope you find something useful. Let me know if you find something should be improved.
okay, if there're something interesting, i'll report it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
All you need is that
- Wav files
- Full-context labels
- HTS-style question file
With those all prepared, it should be very straightforward to implement.
I have wav files of punjabi language. Please guide me to generate full context labels and HTS-style question file