Tacotron-2 Tips for getting good results?

I found that changing the batch size helps with out of memory errors. But I have a few other questions:

Should the sources always end with a period, ie one sentence, one audio file?
How do I interpret checkpoint plots, like step-1600-align? I heard that they should form a straight line to show a good model.
People talk that the model should convert. Does it mean loss should decrease? So if it stays around 0.84500, does that mean, the model is bad? What do I need to change?

In general, some best practises and knowledge how to interpret plotted graphs and derive conclusions would be appreciated.

Attached my current results. step-1600-align step-1650-align step-1800-align step-1900-align step-2000-align step-2100-align step-2500-align step-3000-align step-3500-align step-4000-align step-4500-align step-5000-align step-5500-align step-5500-mel-spectrogram

Sep 12 '18 23:09 ErfolgreichCharismatisch

Train more steps.

The parameter 'output per step' affects the quality of Tacotron model. Reduce 'embedding_dim', it makes converging model more fast. But its lose many language information.

Sep 13 '18 05:09 Yeongtae

Should the sources always end with a period, ie one sentence, one audio file?

yes, you should, because there are some silence at begining and end of sound file. But you do not have to.

How do I interpret checkpoint plots, like step-1600-align? I heard that they should form a straight line to show a good model.

"your model need more training steps."

People talk that the model should convert. Does it mean loss should decrease? So if it stays around 0.84500, does that mean, the model is bad? What do I need to change?

As my experience, the value of loss does not tell you how quality your model is. But it should decrease at the begining of training, and move around a small range when converged.

Of course you can save memory by reducing batch_size, but becareful of information lost.

Sep 13 '18 08:09 Thien223

As my experience, the value of loss does not tell you how quality your model is. But it should decrease at the begining of training, and move around a small range when converged.

Did it converge?

Sep 13 '18 10:09 ErfolgreichCharismatisch

"your model need more training steps."

How many? I recently read that someone had million steps yet still voice generation did not convince.

Knowing how to "fail quicker" would help.

Sep 13 '18 22:09 ErfolgreichCharismatisch

My Tacotron model results human (noisy) voice at step ~2,000th. quite clear at 20,000th Tacotron_results.zip

I dont know but I think it depends on your data.

Sep 14 '18 01:09 Thien223

I am envious. Please share

The number of audio files and length span (e.g. 200 audio files from 5 to 20 seconds)
The total length of all audio files
The audio relevant parts of your hparams.py

Sep 14 '18 09:09 ErfolgreichCharismatisch

This is my 1500th

step-1500-mel-spectrogram In your case, you see the clear differentiation between horizontal lines.

Sep 14 '18 23:09 ErfolgreichCharismatisch

Of course you can save memory by reducing batch_size, but becareful of information lost.

How do I accomplish high quality using a GPU with lower end memory at the expense of time?

Sep 15 '18 19:09 ErfolgreichCharismatisch

Hi, Im training Tacotron2 without WaveNet with a different dataset, and am facing the same problem, robotic voice and disturbances. Maximum length of an audio clip in my dataset is 21 seconds. I have trained my model with a batch size of 16. I am not sure if my alignments are as expected or if i should train more steps. Attached are the results. step-10000-align step-20000-align step-30000-align step-40000-align step-50000-align step-60000-align step-30000-mel-spectrogram step-40000-mel-spectrogram step-50000-mel-spectrogram step-60000-mel-spectrogram