Retrieval-based-Voice-Conversion-WebUI icon indicating copy to clipboard operation
Retrieval-based-Voice-Conversion-WebUI copied to clipboard

More dynamic volume range in future generations?

Open kalomaze opened this issue 1 year ago • 2 comments

When pre-processing the wavs, they are all normalized to keep the dataset even and consistent. This makes sense to me as a method to ensure data processed is relatively similar, so the model knows how to recreate any pitch at the same volume. However, the inference data's dynamic range is completely gone as a consequence, post conversion: image

I'm wondering if it would be possible somehow to programatically have this 'dynamic range' even when a dataset is fully 0db normalized every couple seconds. You can definitely avoid this by manually tweaking and mixing the vocal take in an audio editor, but a part of me wonders if the raspy 'breath issues' that are mentioned on the page could be caused, at least in part, by this uniform data normalization.

I'm not sure if this is in the v2 RVC planning, but I did see 'inference normalization' and 'Inferential post-processing volume envelope fusion input audio volume envelope' were there. I'm a bit confused on what these mean or if they are related to achieving a proper dynamic range. Either way, I'm looking forward to the future efforts from your team

kalomaze avatar May 13 '23 19:05 kalomaze

When pre-processing the wavs, they are all normalized to keep the dataset even and consistent. This makes sense to me as a method to ensure data processed is relatively similar, so the model knows how to recreate any pitch at the same volume. However, the inference data's dynamic range is completely gone as a consequence, post conversion: image

I'm wondering if it would be possible somehow to programatically have this 'dynamic range' even when a dataset is fully 0db normalized every couple seconds. You can definitely avoid this by manually tweaking and mixing the vocal take in an audio editor, but a part of me wonders if the raspy 'breath issues' that are mentioned on the page could be caused, at least in part, by this uniform data normalization.

I'm not sure if this is in the v2 RVC planning, but I did see 'inference normalization' and 'Inferential post-processing volume envelope fusion input audio volume envelope' were there. I'm a bit confused on what these mean or if they are related to achieving a proper dynamic range. Either way, I'm looking forward to the future efforts from your team

@kalomaze If we open a parameter for users to choose the rate of normalization of training data processing, will it help you?

RVC-Boss avatar May 14 '23 03:05 RVC-Boss

Yes, it would be helpful to see if that would make a difference

kalomaze avatar May 14 '23 08:05 kalomaze