tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

Good checkpoint wav, but bad generated wav

Open antontc opened this issue 6 years ago • 8 comments

Hello. I use small dataset. 92k step. Checkpoint wav is so good, but when I try generate wav from text I get something like this: test.zip Language is English. I use basic_cleaner. What I do wrong?

antontc avatar Aug 14 '18 14:08 antontc

Did you achieve alignment?

yoosif0 avatar Aug 23 '18 11:08 yoosif0

I'm facing the situation,how to achieve alignment?

shartoo avatar Aug 29 '18 07:08 shartoo

On small dataset, please try my fork https://github.com/begeekmyfriend/tacotron/tree/master. On master branch, maybe it will get alignment.

begeekmyfriend avatar Aug 29 '18 07:08 begeekmyfriend

I am having the same issue, where the checkpoint audio sounds great, alignment looks great, but audio from eval and demo_server are just noise. I thought it could be overfitting, but I can't reproduce the results of step-xxxxxx-audio.wav for the exact same input text from training.

Regardless of the amount of training data or alignment, I would expect a model to produce the same result during training and inference, for a given input text. Am I missing something?

sfarina avatar Oct 26 '18 19:10 sfarina

For reference, here's an alignment graph generated during training.

step-100000-align and the audio from training and evaluation: audio_clips.zip

The input text is the same for both. Perhaps it has to do with the discontinuity in the alignment...

sfarina avatar Oct 26 '18 19:10 sfarina

In case anyone else is having similar issues, I re-read the paper and assume the author here did the same:

During training, we always feed every r-th ground truth frame to the decoder. The input frame is passed to a pre-net as is done in the encoder.

So it makes sense for training and inference to have very different results for the same input text, as the training has the benefit of hearing every r-th (3rd) frame of ground truth audio.

sfarina avatar Oct 29 '18 02:10 sfarina

@sfarina Hi are you training in English language? I am facing the same issue while training in Hindi language.

See #232

vaibhavthapliyal avatar Nov 11 '18 05:11 vaibhavthapliyal

I have good alignment from 30k to 250k steps on a dataset of 18 hours (English). As mentioned in the issue, I do have good audio(natural sounding) with every step generated while training. But I get a robotic voice on eval or synthesize with any checkpoint. Any possible solution ?

Ruthvicp avatar Feb 25 '19 18:02 Ruthvicp