vits icon indicating copy to clipboard operation
vits copied to clipboard

Training time on VCTK.

Open mudong0419 opened this issue 3 years ago • 5 comments

Thanks for your great work. I have been training a multi-speaker VITS model 160000 steps for 2 days on 8 V100 GPUs. The synthesized speech is clear but not that fluent. How many steps did you trained on VCTK dataset, and how long? Thanks in advance.

mudong0419 avatar Sep 04 '21 13:09 mudong0419

what is the status now? Do you know the training time on LJ? Thanks

MaxMax2016 avatar Sep 16 '21 01:09 MaxMax2016

I'v trained 450000 steps, and synthesized speech is much better.

mudong0419 avatar Sep 17 '21 03:09 mudong0419

160k for 2 days training even in 8 GPUS and I only have 1 gpu, the training speed seems too slow, any ideas to improve this?

HaiFengZeng avatar Jan 28 '22 01:01 HaiFengZeng

160k for 2 days training even in 8 GPUS and I only have 1 gpu, the training speed seems too slow, any ideas to improve this?

I image the slow improvement is largely due to the jointly optimized hifigan decoder, since older acoustic models that predicts mel spectrograms seemed to be much easier to train. Maybe we can adapt to new data based on existing checkpoints? how would one handle the multi-speaker conditioning on the decoder then?

sos1sos2Sixteen avatar Apr 18 '22 13:04 sos1sos2Sixteen

Hey, can you tell me about format dataset for multi-speaker, especially folder structure?

kin0303 avatar Apr 22 '22 06:04 kin0303