gst-tacotron icon indicating copy to clipboard operation
gst-tacotron copied to clipboard

poor alignment when conditioned on reference audios

Open mohsinjuni opened this issue 6 years ago • 4 comments

First of all, thanks very much for taking out time to implement this. I have listened to Audio Samples here and the results are amazing. However, I am unable to replicate this behavior. Could you please help?

I have trained gst-tacotron for 200K steps on LJSpeech-1.1 with default hyperparameters SampleAudios.zip . Encoder-decoder align well during the training. However, during inference when conditioned on unseen reference audio (I used 2nd target reference audio from here), the alignment does not hold.

Following is from training step#200000

step-200000-align

However, when I evaluated with 203K checkpoint, conditioned on reference audio discussed above, I get following.

eval-203000_ref-sample2-align

Without conditioning, (i.e., random)

eval-203000_ref-randomweight-align

Even, style-transfer in voice does not make much difference.

Please find attached zipped file for voice samples.

My Questions:

  • Is there anything I can change to get better quality audio and alignment? Thanks for your help in advance.

  • Can you please share pre-trained model you used to generate Audio Samples here.

mohsinjuni avatar Sep 26 '18 23:09 mohsinjuni

Hi, I guess there is a mismatch between your reference and training audio. In my demo page, I trained the model with Blizzard2011 data. Before that, I random select 500 sentences as test set, which is used to provide reference audio. So for your experiment, the reference audio is from Blizzard2011 database, but your model was trained with LJspeech data.

I'm sorry that now I'm an intern at Tencent, I don't have the pre-trained model now.

syang1993 avatar Sep 27 '18 03:09 syang1993

Hi.. Thanks for your quick response. I was under the impression that I can use any reference-audio (as a style) and use the model to generate new voice in the referenced-audio style. Does it matter which reference I use? Does it have to be from the same distribution as training data? My assumption was that model learns training-data distribution automatically and generates new audio/wav files with the style given in the training data. Please correct me if I am wrong. Thanks again for your help.

mohsinjuni avatar Sep 27 '18 21:09 mohsinjuni

@mohsinjuni Hi, I'm in the same situation like yours, have you figured out if reference audio can be any audio or must be in the same dataset? Thanks.

liangshuang1993 avatar Dec 17 '18 02:12 liangshuang1993

Hi, I have a similar question. Has anyone found a solution for the same?

shrinidhin avatar Oct 22 '19 05:10 shrinidhin