QPPWG Why the synthesized-speech is not better than WORLD?

Why the synthesized-speech is not better than WORLD?

Open csubaiyang opened this issue 5 years ago • 3 comments

trafficstars

Hello, Yi-Chiao WU! I appreciate that if you can read the issue and give me some feedback.

I respectively used the WORLD and the model of QPPWGaf_20(checkpoint400000) from you to be a vocoder to synthesis speech based on my own speech file(from the corpus named LJSpeech-1.1). The process is Speech→Extract feature→synthesis→Speech followed by readme.

But the output of QPPWGaf_20 is neither better nor worse than the output of WORLD.

Just because I didn't take the VCC-corpus to be the input? Or there are other reasons?

Nov 04 '20 12:11 csubaiyang

Hi Yang, Thanks for your question. There are several possible reasons.

First, since the provided pretrain model was trained using very limited data of only VCC2018, the mismatches between new and VCC2018 corpora such as different channel effect will cause performance degradation. Training a new model using LJ speech may solve this problem.

Secondly, if you don't set a suitable F0 rage for the LJ speaker, the extracted F0 will include lots of errors. The QPPWG model is sensitive to the F0 errors. ( The details of how to find a suitable F0 range for a new speaker can be found in the reference sprocket repo.)

By the way, if you can provide some your generated samples, we may have more information about what was going on.

Nov 04 '20 13:11 bigpon

I have got the steps about how to create a figure to manually change the F0 range in conf file. But could you tell me how to get the power thresholds for my own corpus? ^_^ Thx!

Nov 17 '20 03:11 csubaiyang

I think you also get a figure plotting the distribution of power (npowhistogram), right? The figure may have a peak higher than 0 dB related to the most speech frames and another peak around -20~-40dB related to the most silent frames. We usually set the lowest point between these two peaks as the power threshold. For example, the speech frames' peak for LJSpeech is around 2 dB, and the silent frames' peak is around -38 dB. Therefore, the power threshold will be around -30 dB.

Nov 17 '20 04:11 bigpon

QPPWG QPPWG copied to clipboard

Why the synthesized-speech is not better than WORLD?

QPPWG
QPPWG copied to clipboard