vits2_pytorch
vits2_pytorch copied to clipboard
Training using SDP (and with DP by ratio?)
This is a follow up to the previous discussion threads regarding stochastic duration predictor in https://github.com/p0p4k/vits2_pytorch/issues/11 and https://github.com/p0p4k/vits2_pytorch/issues/68#issuecomment-1839917607, as well as with the reference of Bert-VITS2:
Regarding training using SDP, I have a few feedbacks:
-
A few months ago my experiments using
use_sdp
at earlier steps(100K ~ 500K) show below the average results compared to those trained withoutuse_sdp
, the audios did not sound natural and certain pronunciations are not clear. Now I plan to transfer learn a more well-trained checkpoint with SDP(like mentioned in the thread above), would be curious to hear anyone who has done similar experiments. -
I am curious to learn if adding
sdp_ratio
and training both SDP and DP simultaneously would offer any improvements to results. Not sure about how much code changes but would love to add a pr if this sounds good to you! -
About train both SDP & DP together and compare the result to save time(https://github.com/p0p4k/vits2_pytorch/issues/11#issuecomment-1953523368), if we train from scratch using this method my assumption is it does not sound good compared to two stage training.
-
DurationPredictor
works very well from my experience, but is there any improvement can be done regarding both DP models?
==========================
A summary of my experience using use_sdp
so far(will update later when I have more results):
- train using SDP from scratch: does not sound good at all.
- train without SDP from scratch: sound natural, best performing checkpoint to date
- train without SDP from scratch, then continue training using SDP: ?
- train with both SDP & DP by ratio: ?