Retrieval-based-Voice-Conversion-WebUI [Feature Request] 2 modes for optimized results/ better quality : singing or speech

[Feature Request] 2 modes for optimized results/ better quality : singing or speech

Open tomakorea opened this issue 1 year ago • 3 comments

Is your feature request related to a problem? Please describe. I'm mainly use RVC to voice different characters, if most of time it works well enough, in some cases like screams, breath, laughs or vocal fry, the algorithm kind of bug out and can't follow well, make it sound really weird.

Describe the solution you'd like I'm aware some settings under the hood could be tweaked in order to get better results, however, theses settings aren't displayed to the user. It would be great if we could have some presets to select for inference and training, optimizing the quality of the results for speech or for singing. For example : male speech, female speech, children speech, male singing, female singing, etc. It could cover more accurately the vocal range of each character.

Describe alternatives you've considered Right now, I found that using checkpoint fusion can help a tiny bit to extend the vocal range, however, the voice isn't faithful to the original anymore.

Additional context If it's not possible, could we make a pre-trained or a separate breath/scream/laugh model that focuses only on that? then we can blend the "voice noises (like breath, etc.)" model with the speech model of the same character.

Jul 23 '24 09:07 tomakorea

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

Jul 24 '24 08:07 fumiama

Well, if you use larger training dataset including those voices you said, the model may be able to recognize them. Theoretically, the model has the capability to study any voice feature.

in this case, does it affect training quality? many tutorials said the dataset should have a very coherent and stable voice. However, things like whispering, screams and laughs, are very different from the common spoken voice, even though it's from the same person. So, would it be beneficial to train 2 models? one only for screams/breath and one only for spoken word?

Jul 24 '24 08:07 tomakorea

does it affect training quality?

I'm not pretty sure about it because I haven't had a test.

the dataset should have a very coherent and stable voice.

Yes, because the dataset used for training in RVC is usually quite small. The large I said means that the dataset should be the same scale as what trained the pre-trained model.

would it be beneficial to train 2 models?

Maybe, but it's hard to split those parts.

Jul 24 '24 08:07 fumiama

Retrieval-based-Voice-Conversion-WebUI Retrieval-based-Voice-Conversion-WebUI copied to clipboard

[Feature Request] 2 modes for optimized results/ better quality : singing or speech

Retrieval-based-Voice-Conversion-WebUI
Retrieval-based-Voice-Conversion-WebUI copied to clipboard