Tacotron-2 icon indicating copy to clipboard operation
Tacotron-2 copied to clipboard

Tips for getting good results?

Open ErfolgreichCharismatisch opened this issue 5 years ago • 9 comments

I found that changing the batch size helps with out of memory errors. But I have a few other questions:

  1. Should the sources always end with a period, ie one sentence, one audio file?
  2. How do I interpret checkpoint plots, like step-1600-align? I heard that they should form a straight line to show a good model.
  3. People talk that the model should convert. Does it mean loss should decrease? So if it stays around 0.84500, does that mean, the model is bad? What do I need to change?

In general, some best practises and knowledge how to interpret plotted graphs and derive conclusions would be appreciated.

Attached my current results. step-1600-align step-1650-align step-1800-align step-1900-align step-2000-align step-2100-align step-2500-align step-3000-align step-3500-align step-4000-align step-4500-align step-5000-align step-5500-align step-5500-mel-spectrogram

Train more steps.

The parameter 'output per step' affects the quality of Tacotron model. Reduce 'embedding_dim', it makes converging model more fast. But its lose many language information.

Yeongtae avatar Sep 13 '18 05:09 Yeongtae

  1. Should the sources always end with a period, ie one sentence, one audio file?
  • yes, you should, because there are some silence at begining and end of sound file. But you do not have to.
  1. How do I interpret checkpoint plots, like step-1600-align? I heard that they should form a straight line to show a good model.
  • "your model need more training steps."
  1. People talk that the model should convert. Does it mean loss should decrease? So if it stays around 0.84500, does that mean, the model is bad? What do I need to change?
  • As my experience, the value of loss does not tell you how quality your model is. But it should decrease at the begining of training, and move around a small range when converged.

Of course you can save memory by reducing batch_size, but becareful of information lost.

Thien223 avatar Sep 13 '18 08:09 Thien223

  • As my experience, the value of loss does not tell you how quality your model is. But it should decrease at the begining of training, and move around a small range when converged.

Did it converge?

  • "your model need more training steps."

How many? I recently read that someone had million steps yet still voice generation did not convince.

Knowing how to "fail quicker" would help.

My Tacotron model results human (noisy) voice at step ~2,000th. quite clear at 20,000th Tacotron_results.zip

I dont know but I think it depends on your data.

Thien223 avatar Sep 14 '18 01:09 Thien223

I am envious. Please share

  • The number of audio files and length span (e.g. 200 audio files from 5 to 20 seconds)
  • The total length of all audio files
  • The audio relevant parts of your hparams.py

This is my 1500th

step-1500-mel-spectrogram In your case, you see the clear differentiation between horizontal lines.

Of course you can save memory by reducing batch_size, but becareful of information lost.

How do I accomplish high quality using a GPU with lower end memory at the expense of time?

Hi, Im training Tacotron2 without WaveNet with a different dataset, and am facing the same problem, robotic voice and disturbances. Maximum length of an audio clip in my dataset is 21 seconds. I have trained my model with a batch size of 16. I am not sure if my alignments are as expected or if i should train more steps. Attached are the results. step-10000-align step-20000-align step-30000-align step-40000-align step-50000-align step-60000-align step-30000-mel-spectrogram step-40000-mel-spectrogram step-50000-mel-spectrogram step-60000-mel-spectrogram

Can anyone suggest any way to get better results and less disturbance. Thanks!

helloworld691 avatar Jan 24 '20 18:01 helloworld691