OpenSeq2Seq icon indicating copy to clipboard operation
OpenSeq2Seq copied to clipboard

Tacotron-2 GST expected result

Open raymond00000 opened this issue 6 years ago • 1 comments

Hi,

I downloaded the checkpoint from here: https://nvidia.github.io/OpenSeq2Seq/html/speech-synthesis.html#speech-synthesis I followed the tutorial to generate an example audio.

My understanding is that the checkpoint was built by M-AILABS dataset. According to the paper section 7.2 , " to synthesize with a specific speaker’s voice, we can simply feed audio from that speaker as a reference signal.", the GST will become the speaker embedding. So in the inference step, I can supply a new English female audio to clone the speaker's voice.

Here is my question: (1) Is my understanding correct? (2) I applied an English female audio, but the output is still a male voice.. Is it because the female audio is not seen speaker? (3) What is the difference between the generated "infer_mag.wav" and "infer.wav"?

Thanks!

raymond00000 avatar Oct 04 '19 02:10 raymond00000

  1. Our repo is a re-implementation of the paper so I cannot speak to any claims by the paper. In theory that's how we hope tacotron-2 gst should work. In practice, it is very dependent on your training data.

  2. Yes I highly doubt that our tacotron 2 gst will generalize to speakers outside the training set.

  3. Infer_mag.wav is the griffin-lim reconstruction of the linear/magnitude spectrogram. Infer.wav is the griffin-lim reconstruction of the mel spectrogram which is converted to the linear spectrogram while a matmul with the mel basis. Infer_mag.wav should in general sound better than infer.wav

blisc avatar Oct 22 '19 18:10 blisc