Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

[Feature Request] Generative-based AI isolations?

Open kalomaze opened this issue 1 year ago • 0 comments

I totally understand if this is out of the scope of RVC as a project in general, and I don't have the technical qualifications to say if it would be something that this team could reasonably accomplish, but I noticed the built in UVR integration for vocal isolation on RVC, and had an idea that could potentially work based on how the RVC project already synthesizes data to match a trained model.

Currently, UVR isolation models use masking-based models, which attempts to capture as much of the vocals as possible. But I was wondering what would happen if you took the public dataset for UVR (for example, the 2GB multi-song dataset), and ran UVR isolations through the full songs used in it. Then, using both the true acapellas of the dataset, and the UVR equivalent isolations, you could set up a generative machine learning framework that would take the 'noisy' ai isolations and attempt to generate a version that matches the studio quality dataset, to make up for the fact extraction techniques are always lossy.

If these outputs came out decent at all, you could set up a workflow where you feed the UVR outputs to the tool to create 'studio quality' equivalents (of course, manual cleanup and data curation would still be required), and then train a dataset using that.

It would be similar to how Adobe Audition is already capable of enhancing speech, except in this instance, it would be trained specifically to combat UVR-style artifacts instead of low quality mic artifacts.

So to reiterate, the idea is to take the music stems dataset, and run vocal isolations on the 'full' songs from that dataset, and then compare those isolations to the official, high quality acapella equivalents in that dataset, so that the model can learn what a UVR isolation sounds like and then generate to match the style of an 'official' vocal.

I would recommend maybe Kim vocal 1 or Kim vocal 2 models, or Inst HQ 1 if slight instrumental bleed could also be reduced through a technique like this (maybe would make it a lot worse if it's not just vocals it's trained to clean up or it could lead to good results? who knows). Maybe sticking to one specific model could help ensure the generations are as high quality as possible?

kalomaze avatar May 15 '23 14:05 kalomaze