nnmnkwii icon indicating copy to clipboard operation
nnmnkwii copied to clipboard

Document how to build speech synthesis system for new languages

Open r9y9 opened this issue 6 years ago • 11 comments

All you need is that

  • Wav files
  • Full-context labels
  • HTS-style question file

With those all prepared, it should be very straightforward to implement.

r9y9 avatar Aug 17 '17 17:08 r9y9

Hi thank you for this work. About this issue: Is the above recipe complete in the case of tonal languages? esp. those with rising and falling tones/pitch on vowels, nasal consonants?

ruohoruotsi avatar Dec 28 '17 19:12 ruohoruotsi

Yes. I think you will need to annotate the accent information (rising/failing tones) as the HTS-style label.

r9y9 avatar Dec 29 '17 08:12 r9y9

Hi, 山本さん。I'm trying to synthesize mandarin using your tool. To my knowledge, i need to do forced alignment manually in previous. And then writing a frontend that adapts the language to extract linguistic features. So does that mean i only need to replace the frontend part? And could i using other forced alignment tools such as "montreal" at other alignment level which is neither 'state' nor 'phone', for example, 'syllable'?

attitudechunfeng avatar Jan 03 '18 07:01 attitudechunfeng

こんにちは、 @attitudechunfeng !

So does that mean i only need to replace the frontend part?

Yes, you can reuse other parts. You can also reuse a part of frontend (https://r9y9.github.io/nnmnkwii/latest/references/frontend.html#frontend) to convert your linguistic features to its numeric representation at either phone, state or frame-level if you use the HTS-style label format.

And could i using other forced alignment tools such as "montreal" at other alignment level which is neither 'state' nor 'phone', for example, 'syllable'?

You could, but then you cannot reuse https://r9y9.github.io/nnmnkwii/latest/references/frontend.html#frontend, since it assumes state or phone-level alignment.

r9y9 avatar Jan 03 '18 08:01 r9y9

本当にありがとう!I'll try it.

attitudechunfeng avatar Jan 03 '18 08:01 attitudechunfeng

Alternatively, you could consider end-to-end approach, which doesn't require alignment as well as linguistic feature extraction (the hard part of the TTS!). See https://github.com/r9y9/deepvoice3_pytorch if you are interested.

r9y9 avatar Jan 03 '18 09:01 r9y9

Thank u. In fact, i'm also following your other excellent tts projects. However, i'm now trying do some work about offline usage, end-to-end models are not convenient to be transferred to mobile devices and its speed on cpu is also can't be guaranteed. So i have to use traditional method.

attitudechunfeng avatar Jan 03 '18 09:01 attitudechunfeng

I see. I hope you find something useful. Let me know if you find something should be improved.

r9y9 avatar Jan 03 '18 09:01 r9y9

okay, if there're something interesting, i'll report it.

attitudechunfeng avatar Jan 03 '18 09:01 attitudechunfeng

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 30 '19 01:05 stale[bot]

All you need is that

  • Wav files
  • Full-context labels
  • HTS-style question file

With those all prepared, it should be very straightforward to implement.

I have wav files of punjabi language. Please guide me to generate full context labels and HTS-style question file

HarmanGhawaddi avatar Apr 10 '21 09:04 HarmanGhawaddi