[Feature]: synthesis quality
Description
Hello ! Quick question about synthesis quality,
I'm noticing that, even with well-trained models and high index rates, certain fine-grained features like breath sounds and fricatives (especially "s" and "sh") are often smoothed out or not reproduced clearly. Sometimes, breaths are skipped entirely, or "sss" turns into a mushy noise or gets truncated.
Is this:
due to the limitations of the encoder (e.g. HuBERT or ContentVec)?
a consequence of vocoder smoothing (e.g. HiFi-GAN artifacts)?
something that could be improved by data preprocessing (retaining more high frequencies)?
or maybe something like RMVPE not tracking breathy textures accurately?
Thanks in advance ! 🙏
Problem
Any tip to improve these details during inference or training (e.g. adding breath-heavy samples, using different pitch extractors, or training with higher sampling rates)?
Proposed Solution
Include breath-heavy phrases: Add clips where the speaker breathes audibly (inhaling/exhaling), especially between phrases.
Alternatives Considered
Maybe other algorithm of vocoder ?
I have the same doubts.
i think this might be accurate for you https://deepwiki.com/search/description-hello-quick-questi_e4904694-54e3-42cd-b050-0d8a904197c9
Having completed a couple of 1000-epoch training runs this week, my conclusion is that data quality (fidelity) matters the most. It definitely helps if you slightly sharpen the audio in your training data. And after the inference, It's easy to reduce back to normal. But the rule of thumb is that your training audio should be as dry as possible (no reverbs, no echoing, no chorus or voice doubling) and as sharp as possible. If you don't have access to high quality recording studio for your dataset, then Voicefixer can do wonders - even for mediocre quality audio.
Also, fun fact... blending 2 models (smooth and a sharp) together eventually gave me the best model that actually still manages to blow me away. An untrained ear couldn't even suspect that the voice isn't even real or that your audio recording isn't coming from some fancy studio. The inferenced result will always inherit the qualities and characteristics of the training data set, so it's not just the voice that you're trianing into your model.
Having completed a couple of 1000-epoch training runs this week
If you're using Applio version 3.4.0+ you're probably wasting 10x more time than needed for a model to sound good.