vits2_pytorch icon indicating copy to clipboard operation
vits2_pytorch copied to clipboard

Training using SDP (and with DP by ratio?)

Open isdanni opened this issue 1 year ago • 6 comments

This is a follow up to the previous discussion threads regarding stochastic duration predictor in https://github.com/p0p4k/vits2_pytorch/issues/11 and https://github.com/p0p4k/vits2_pytorch/issues/68#issuecomment-1839917607, as well as with the reference of Bert-VITS2:

Regarding training using SDP, I have a few feedbacks:

  1. A few months ago my experiments using use_sdp at earlier steps(100K ~ 500K) show below the average results compared to those trained without use_sdp, the audios did not sound natural and certain pronunciations are not clear. Now I plan to transfer learn a more well-trained checkpoint with SDP(like mentioned in the thread above), would be curious to hear anyone who has done similar experiments.

  2. I am curious to learn if adding sdp_ratio and training both SDP and DP simultaneously would offer any improvements to results. Not sure about how much code changes but would love to add a pr if this sounds good to you!

  3. About train both SDP & DP together and compare the result to save time(https://github.com/p0p4k/vits2_pytorch/issues/11#issuecomment-1953523368), if we train from scratch using this method my assumption is it does not sound good compared to two stage training.

  4. DurationPredictor works very well from my experience, but is there any improvement can be done regarding both DP models?

==========================

A summary of my experience using use_sdp so far(will update later when I have more results):

  • train using SDP from scratch: does not sound good at all.
  • train without SDP from scratch: sound natural, best performing checkpoint to date
  • train without SDP from scratch, then continue training using SDP: ?
  • train with both SDP & DP by ratio: ?

isdanni avatar Feb 27 '24 03:02 isdanni