tacotron Good checkpoint wav, but bad generated wav

Hello. I use small dataset. 92k step. Checkpoint wav is so good, but when I try generate wav from text I get something like this: test.zip Language is English. I use basic_cleaner. What I do wrong?

Aug 14 '18 14:08 antontc

Did you achieve alignment?

Aug 23 '18 11:08 yoosif0

I'm facing the situation,how to achieve alignment?

Aug 29 '18 07:08 shartoo

On small dataset, please try my fork https://github.com/begeekmyfriend/tacotron/tree/master. On master branch, maybe it will get alignment.

Aug 29 '18 07:08 begeekmyfriend

I am having the same issue, where the checkpoint audio sounds great, alignment looks great, but audio from eval and demo_server are just noise. I thought it could be overfitting, but I can't reproduce the results of step-xxxxxx-audio.wav for the exact same input text from training.

Regardless of the amount of training data or alignment, I would expect a model to produce the same result during training and inference, for a given input text. Am I missing something?

Oct 26 '18 19:10 sfarina

For reference, here's an alignment graph generated during training.

step-100000-align and the audio from training and evaluation: audio_clips.zip

The input text is the same for both. Perhaps it has to do with the discontinuity in the alignment...

Oct 26 '18 19:10 sfarina

In case anyone else is having similar issues, I re-read the paper and assume the author here did the same:

During training, we always feed every r-th ground truth frame to the decoder. The input frame is passed to a pre-net as is done in the encoder.

So it makes sense for training and inference to have very different results for the same input text, as the training has the benefit of hearing every r-th (3rd) frame of ground truth audio.

Oct 29 '18 02:10 sfarina

@sfarina Hi are you training in English language? I am facing the same issue while training in Hindi language.

See #232

Nov 11 '18 05:11 vaibhavthapliyal

I have good alignment from 30k to 250k steps on a dataset of 18 hours (English). As mentioned in the issue, I do have good audio(natural sounding) with every step generated while training. But I get a robotic voice on eval or synthesize with any checkpoint. Any possible solution ?

Feb 25 '19 18:02 Ruthvicp

tacotron tacotron copied to clipboard

Good checkpoint wav, but bad generated wav

tacotron
tacotron copied to clipboard