tacotron icon indicating copy to clipboard operation
tacotron copied to clipboard

How many epochs?

Open Kyubyong opened this issue 8 years ago • 20 comments

Currently, I'm running the train.py. The paper says "..., which starts from 0.001 and is reduced to 0.0005, 0.0003, and 0.0001 after 500K, 1M and 2M global steps". 2 million?! So I think I need to be patient. How many epochs do I have to run until some human-like samples are generated? Have you guys tried for more than 100 epochs?

Kyubyong avatar May 27 '17 00:05 Kyubyong

What error value you have achieved so far? Could you please post convergence plots? I might be wrong, but I think current model need some work before it will be able to do synthesize speech from text. For single wav file, one need to get error below 0.08 on average to hear good speech (reached after 400 epoch with total 2400 weight update steps (I put 6 identical files in the list). And I had to change default learning rate to achieve that. For two wav files, I was not able to train it to speak both files, optimization got stuck near 0.10, despite my all efforts to find good learning rate /optimizer. In text-to-text sequence to sequence models if one can't get them to reproduce few training samples exactly, that usually means they won't work for larger sets as well, although there are some exceptions. So debugging model on simple cases probably needed. Maybe do what paper describes as "ablation experiments", using simple GRU encoder and see if it works.

Durham avatar May 27 '17 09:05 Durham

I trained a single wav file, used 2 identical files in the list, changed dropouts to 1.0 and training rate to 0.01. Trained for 1350k steps (I think it was more than 1000 epochs) and loss came down to 0.057. (18h 41m on gtx 1080).

screenshot

Tried to generate a sound with the model using the same text, 3/5 of the file is silent, remaining has some low quality speech. I was trying to overfit the network and see how it would generate.

One thing that I did not understand is that; while the loss is 0.057 on training data, evaluation script shows around 0.58 loss with the same text and wav. Maybe someone can explain the difference between the losses?

onyedikilo avatar May 27 '17 12:05 onyedikilo

The fatc that 3/5 of the generated file is silent looks fine, because we intended to reconstruct them (zero paddings). The training curve looks good, too. When I was training the whole data, the training curve looks messy. Simply, it keeps hanging around 0.2.

Kyubyong avatar May 28 '17 01:05 Kyubyong

The silence was at the beginning of the file not in the end.

I believe it is messy because you are using dropout of 0.5 and learning rate of 0.0001, it should converge in time and the spikes will get smaller and smaller gradually.

onyedikilo avatar May 28 '17 14:05 onyedikilo

I trained with a single file for about 2000 epochs and got this, where loss1 is the seq2seq loss, and loss2 is the spectrogram loss. Total training loss was about 0.017.

screenshot from 2017-05-29 17-41-05

ghost avatar May 29 '17 05:05 ghost

I trained the model with full data for about 130 epoches. The best loss I got was about 0.14. The loss figures is as follows:

image

Here is the synthesized audio: http://pan.baidu.com/s/1skMStGT

candlewill avatar Jun 01 '17 07:06 candlewill

Did you adjust the learnrate and what was your batch_size?

On Jun 1, 2017 09:22, "Yunchao He" [email protected] wrote:

I trained the model with full data for about 130 epoches. The best loss I got was about 0.14. The loss figures is as follows:

[image: image] https://cloud.githubusercontent.com/assets/4916563/26668252/6cca683c-46db-11e7-8723-7e203cd85711.png

Here is the synthesized audio: http://pan.baidu.com/s/1skMStGT

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Kyubyong/tacotron/issues/7#issuecomment-305410995, or mute the thread https://github.com/notifications/unsubscribe-auth/AC9HFPXQF8EJQhFgLbIEtUsKJFodWtPeks5r_mbMgaJpZM4NoK6s .

Spotlight0xff avatar Jun 01 '17 07:06 Spotlight0xff

@Spotlight0xff I kept all the hyper parameters unchanged.

candlewill avatar Jun 01 '17 07:06 candlewill

@candlewill how long did it take your machine to reach 180k steps?

ghost avatar Jun 01 '17 22:06 ghost

@minsangkim142 It takes about five days with two Tesla M40 24GB GPUs (just one for computation).

candlewill avatar Jun 02 '17 01:06 candlewill

New synthesized speech samples here: http://pan.baidu.com/s/1miohdVy

It was trained on a small data. Just Revelation from Bible was used. Epoch 2000. Best loss 0.53.

image

candlewill avatar Jun 05 '17 04:06 candlewill

Some human-like voice is heard, though I can't recognize what he(?)'s saying about. (I think it's natural because the data is far from enough) I've recently revised the code. When did you start training?

Kyubyong avatar Jun 05 '17 06:06 Kyubyong

Coming soon!

candlewill avatar Jun 05 '17 06:06 candlewill

@candlewill @Kyubyong any new updates ? Thanks!

xuerq avatar Jun 07 '17 01:06 xuerq

@xuerq I'm running a sanity-check test. I'll share with you as soon as it's done.

Kyubyong avatar Jun 07 '17 10:06 Kyubyong

Does it learn attention when you use only one sample for training? I'm worried about just memorizing the whole speech sample rather than predicting it from the text input.

root20 avatar Jun 08 '17 09:06 root20

@candlewill Hi, do you have any suggestions to train the model. I listened to the samples from http://pan.baidu.com/s/1miohdVy. Though the results are not good, it is less noisy than what I synthesized. Really appreciate your answer.

jpdz avatar Jul 31 '17 07:07 jpdz

screenshot from 2017-08-21 10-27-21 I train model tacotron with 3 file audio but loss function very high. Data using is Vietnamese

tuong-olli avatar Aug 21 '17 03:08 tuong-olli

What are the default number of epochs? And where is it in the code?

ashupednekar avatar Aug 02 '19 05:08 ashupednekar

@ashupednekar Did you find it?

giridhar-pamisetty avatar Apr 04 '20 03:04 giridhar-pamisetty