Retrieval-based-Voice-Conversion-WebUI More dynamic volume range in future generations?

More dynamic volume range in future generations?

Open kalomaze opened this issue 1 year ago • 2 comments

When pre-processing the wavs, they are all normalized to keep the dataset even and consistent. This makes sense to me as a method to ensure data processed is relatively similar, so the model knows how to recreate any pitch at the same volume. However, the inference data's dynamic range is completely gone as a consequence, post conversion:

I'm wondering if it would be possible somehow to programatically have this 'dynamic range' even when a dataset is fully 0db normalized every couple seconds. You can definitely avoid this by manually tweaking and mixing the vocal take in an audio editor, but a part of me wonders if the raspy 'breath issues' that are mentioned on the page could be caused, at least in part, by this uniform data normalization.

I'm not sure if this is in the v2 RVC planning, but I did see 'inference normalization' and 'Inferential post-processing volume envelope fusion input audio volume envelope' were there. I'm a bit confused on what these mean or if they are related to achieving a proper dynamic range. Either way, I'm looking forward to the future efforts from your team

May 13 '23 19:05 kalomaze

When pre-processing the wavs, they are all normalized to keep the dataset even and consistent. This makes sense to me as a method to ensure data processed is relatively similar, so the model knows how to recreate any pitch at the same volume. However, the inference data's dynamic range is completely gone as a consequence, post conversion:

I'm wondering if it would be possible somehow to programatically have this 'dynamic range' even when a dataset is fully 0db normalized every couple seconds. You can definitely avoid this by manually tweaking and mixing the vocal take in an audio editor, but a part of me wonders if the raspy 'breath issues' that are mentioned on the page could be caused, at least in part, by this uniform data normalization.

I'm not sure if this is in the v2 RVC planning, but I did see 'inference normalization' and 'Inferential post-processing volume envelope fusion input audio volume envelope' were there. I'm a bit confused on what these mean or if they are related to achieving a proper dynamic range. Either way, I'm looking forward to the future efforts from your team

@kalomaze If we open a parameter for users to choose the rate of normalization of training data processing, will it help you?

May 14 '23 03:05 RVC-Boss

Yes, it would be helpful to see if that would make a difference

May 14 '23 08:05 kalomaze

Retrieval-based-Voice-Conversion-WebUI Retrieval-based-Voice-Conversion-WebUI copied to clipboard

More dynamic volume range in future generations?

Retrieval-based-Voice-Conversion-WebUI
Retrieval-based-Voice-Conversion-WebUI copied to clipboard