Retrieval-based-Voice-Conversion-WebUI Is it expected that v2 trained models underperform compared to v1 trained models on identical settings?

Is it expected that v2 trained models underperform compared to v1 trained models on identical settings?

Open kalomaze opened this issue 1 year ago • 6 comments

5 minute dataset. I also used the Mangio fork which adds 'crepe' as a training option for both of these models. Maybe for a model with a bigger dataset like 40 minutes, v2 outperforms?

May 16 '23 16:05 kalomaze

I have been informed it is much easier to 'overtrain' v2 models. Both are 1000 epochs. Will look into that first, but I'm running into an issue where TensorBoard cuts out at 6k steps, when the full model in theory has 20+ thousand steps, due to a batch size of 4 with a dataset of about ~5 minutes (so roughly ~100 post-processing wavs)

May 16 '23 20:05 kalomaze

Try both 20epoch model? Also you can try the latest v2 model pretrained weights. Now it's an experimental model and therefore performence may be decrease in some cases. We won't publish the release until it's tested completely.

May 17 '23 02:05 RVC-Boss

The "v2" files here are using the experimental v2 weights. The "v1" example is the normally trained weights model.

According to TensorBoard, the model is 6,000 steps. This may be innaccurate and it is improperly reading the logs? (1000 epochs, 5 minute dataset, batch size 4). I am told that, if /g/total starts rising again on the graph after peaking, it is a sign the model is overtraining. If that is correct, 300-400 epochs was the correct ending point, and 1000 epochs is far too much (for the v2 model only). I can see this being correct for such a small dataset. This person also told me that they saw this overtraining effect start at around 1000 epochs on a 40 minute dataset (again, just on v2).

With this in mind, the 350 epochs model seems to beat the 1000 epochs v1 model based on my first attempt, but more testing is needed. It seems that it is much easier to overtrain a v2 model

v1_1000_EPOCHS_VS_v2_350_EPOCHS.zip

May 17 '23 04:05 kalomaze

The 350e_v2 version is significantly better than 1000e_v2 (1st floor) version.

May 17 '23 09:05 RVC-Boss

In the earliest version, the max total_epoch is 100, than updated to 200. Somebody says more epoch may be better, needs 1000epoch, so now max total_epoch is 1000. But I think less epoch training is the advantage of RVC.

May 17 '23 09:05 RVC-Boss

Something is definitely wrong with the steps count on that model. I trained a new one today for a different (8 minute) dataset. This is 600 epochs and it's 30k steps.... 5k is absolutely not the real number of the other one. Do you know what might cause this? Also, this new '30k steps' model defenitely gets beaten out by the version 1 equivalent, so I don't know if there's a better way to tell via tensorboard statistics if the model is overtraining.

v1_vs_v2_Mord.zip

May 17 '23 13:05 kalomaze

Retrieval-based-Voice-Conversion-WebUI Retrieval-based-Voice-Conversion-WebUI copied to clipboard

Is it expected that v2 trained models underperform compared to v1 trained models on identical settings?

Retrieval-based-Voice-Conversion-WebUI
Retrieval-based-Voice-Conversion-WebUI copied to clipboard