tacotron
tacotron copied to clipboard
Good checkpoint wav, but bad generated wav
Hello. I use small dataset. 92k step. Checkpoint wav is so good, but when I try generate wav from text I get something like this: test.zip Language is English. I use basic_cleaner. What I do wrong?
Did you achieve alignment?
I'm facing the situation,how to achieve alignment?
On small dataset, please try my fork https://github.com/begeekmyfriend/tacotron/tree/master. On master branch, maybe it will get alignment.
I am having the same issue, where the checkpoint audio sounds great, alignment looks great, but audio from eval and demo_server are just noise. I thought it could be overfitting, but I can't reproduce the results of step-xxxxxx-audio.wav
for the exact same input text from training.
Regardless of the amount of training data or alignment, I would expect a model to produce the same result during training and inference, for a given input text. Am I missing something?
For reference, here's an alignment graph generated during training.
and the audio from training and evaluation:
audio_clips.zip
The input text is the same for both. Perhaps it has to do with the discontinuity in the alignment...
In case anyone else is having similar issues, I re-read the paper and assume the author here did the same:
During training, we always feed every r-th ground truth frame to the decoder. The input frame is passed to a pre-net as is done in the encoder.
So it makes sense for training and inference to have very different results for the same input text, as the training has the benefit of hearing every r-th (3rd) frame of ground truth audio.
@sfarina Hi are you training in English language? I am facing the same issue while training in Hindi language.
See #232
I have good alignment from 30k to 250k steps on a dataset of 18 hours (English). As mentioned in the issue, I do have good audio(natural sounding) with every step generated while training. But I get a robotic voice on eval or synthesize with any checkpoint. Any possible solution ?