vits icon indicating copy to clipboard operation
vits copied to clipboard

How many steps should we train to get the best results?

Open futureaiengineeer opened this issue 3 years ago • 12 comments

I train my custom 10-hour 44,1Khz dataset till 400k steps, but the models seems not to synthesize good results. I wonder how many steps should i train the model to get the best result?

futureaiengineeer avatar Oct 06 '22 03:10 futureaiengineeer

10 hours seems to be a little too long. I used 2-hour one-speaker dataset and get a good result with 270K step(batch size 16).

Maybe the quality of your dataset is poor, or something else is wrong.

ZJ-CAI avatar Nov 21 '22 09:11 ZJ-CAI

Hello Everyone,

I'm training a VCTK dataset (22050 sampling rate), downloaded, for the multi-Speaker model. I have trained for 350000 steps and yet the quality of synthesis is not good as pre-trained models in the repo. How many steps will get a similar result?

Dataset was resampled by me from 48000 to 22050.

Dataset Source : https://www.kaggle.com/datasets/showmik50/vctk-dataset

athenasaurav avatar Dec 12 '22 06:12 athenasaurav

One update, I noticed that in my dataset, there is initial silence in most of the audio files (before getting downsampled), so it remains in 22050Hz data as well.

I have used set_frame_rate function from pydub import AudioSegment to downsample the audio files only but didn't use librosa to trim silence. Is it necessary to trim silence at the start and end of every file for better results?

athenasaurav avatar Dec 12 '22 07:12 athenasaurav

Authors used 300k steps with batch = 64, start from that.

nikich340 avatar Jan 11 '23 05:01 nikich340

@athenasaurav, did you end up having to remove the silence? I got to 100k steps and when generating, I get only silence. I thought maybe the problem is also that I didn’t cut off the silence in the dataset.

LanglyAdrian avatar Feb 26 '23 08:02 LanglyAdrian

@LanglyAdrian yes silence do create some issue but i started getting better results after more epochs.

athenasaurav avatar Feb 26 '23 11:02 athenasaurav

@athenasaurav can you look this ? I already doubt that the problem is in the presence of silence. After 130k, there should be at least some sounds, but here it's just silence.

LanglyAdrian avatar Feb 26 '23 11:02 LanglyAdrian

@LanglyAdrian not sure what your problem can be. Can you share your inference code. Also you are passing the speaker ids as per VCTK data in filelist?

athenasaurav avatar Feb 26 '23 11:02 athenasaurav

@athenasaurav , For inference, I use the code from the colab, but with my weights. image

I'm using the original filelists and I've checked that all the wavs are in the correct places and match the information in the filelists.

LanglyAdrian avatar Feb 27 '23 13:02 LanglyAdrian

Authors used 300k steps with batch = 64, start from that. @nikich340 Authors used 300k steps with batch = 64, start from that. My batch_size=8; Do I need to train 2400k steps to get the results? I can only train 3k steps in an hour now, because an epoch takes about 5 minutes. That would take me about 800 hours to train, right? Is that reasonable? I am looking forward to your reply.Thank you very much!

Linghuxc avatar May 04 '23 12:05 Linghuxc

@Linghuxc

It doesnt work like that, i have trained for around 350k steps with a batch size of 16 and got good quality. You can do the same with batch size 8. Batch size basically dependent on you GPU size.

athenasaurav avatar May 04 '23 12:05 athenasaurav

@athenasaurav

Ok, I see what you mean. Thank you very much for your answer!

Linghuxc avatar May 04 '23 12:05 Linghuxc