Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

10% of audio data is duplicated during preprocessing

Open kalomaze opened this issue 1 year ago • 4 comments

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/assets/66376113/ccc991bc-f614-4753-a533-ab0afdd017cf

If you check the preprocessed wavs folder before doing feature extract, you can notice that they are not properly aligned when put up next to each other back to back. This applies for about the last ~0.3s, so it's duplicating roughly 10% of the dataset unnecessarily, slightly worsening the quality of the model (?) Unless there is some reason for this, it seems like a bug that it happens every single time, duplicate data in theory doesn't help the model understand different pitches or tone at all.

Also, to prevent cutting off mid-sample, you could set up something to find the quietest point within the last part of the 3.7s split (maybe the last second?) and then cut there, that cut point onwards would be treated as the start of the new segment.

kalomaze avatar May 31 '23 21:05 kalomaze

On top of this, SOVITS training used to require you to do this processing by hand, and the standard at the time was to use 10s portions instead of ~4 second portions (which is what RVC currently does.) I would recommend boosting this to 10s, since this may have potential gains for model training speed? Or maybe a low value is better. I wouldn't know. Maybe add an option for what seconds interval it splits by?

kalomaze avatar May 31 '23 21:05 kalomaze

Using duplicate is right. The duration training audio-clip of 40k setting is 0.32s. If you don't set the duplicate part, the trainingset will be decrease, because the data across two parts can't be learnt.

A longer segment duration may not necessarily achieve better results, but it will definitely increase the requirement of graphics memory, which makes it difficult for many people using low-end graphics cards to train. The configuration of sovits may not always be optimal.

RVC-Boss avatar Jun 01 '23 02:06 RVC-Boss

Wouldn't splitting the portions based on detected silent gaps between words be a more ideal way to do that (e.g check every 0.05s for average DB in the last 1s of the segment, then cut in the center of the quietest point) and continue next segment from there? If the point is to preserve continuity, why not just split based on when speech naturally ends rather than a solution that seems more 'rough' in approximating features (and likely introduces more duplicate data bias?) Or is this some form of advanced data augmentation I'm not familiar with?

kalomaze avatar Jun 01 '23 03:06 kalomaze

At present, the data normalization strategy of "rvc" is to force slicing again after being sliced by "auto slicer", which will ultimately cut the dataset into multiple segments of about 3 seconds. Increasing the threshold of forced slicing will significantly increase the graphics memory usage. The component 'auto slicer' has adopted splitting the ports based on detected silent gaps between words

ms903x1 avatar Jun 01 '23 19:06 ms903x1