Applio icon indicating copy to clipboard operation
Applio copied to clipboard

[Feature]: synthesis quality

Open LaScienceMusicale opened this issue 5 months ago • 2 comments

Description

Hello ! Quick question about synthesis quality,

I'm noticing that, even with well-trained models and high index rates, certain fine-grained features like breath sounds and fricatives (especially "s" and "sh") are often smoothed out or not reproduced clearly. Sometimes, breaths are skipped entirely, or "sss" turns into a mushy noise or gets truncated.

Is this:

due to the limitations of the encoder (e.g. HuBERT or ContentVec)?

a consequence of vocoder smoothing (e.g. HiFi-GAN artifacts)?

something that could be improved by data preprocessing (retaining more high frequencies)?

or maybe something like RMVPE not tracking breathy textures accurately?

Thanks in advance ! 🙏

Problem

Any tip to improve these details during inference or training (e.g. adding breath-heavy samples, using different pitch extractors, or training with higher sampling rates)?

Proposed Solution

Include breath-heavy phrases: Add clips where the speaker breathes audibly (inhaling/exhaling), especially between phrases.

Alternatives Considered

Maybe other algorithm of vocoder ?

LaScienceMusicale avatar Aug 05 '25 12:08 LaScienceMusicale

I have the same doubts.

Anilams avatar Sep 07 '25 06:09 Anilams

i think this might be accurate for you https://deepwiki.com/search/description-hello-quick-questi_e4904694-54e3-42cd-b050-0d8a904197c9

blaisewf avatar Sep 12 '25 22:09 blaisewf

Having completed a couple of 1000-epoch training runs this week, my conclusion is that data quality (fidelity) matters the most. It definitely helps if you slightly sharpen the audio in your training data. And after the inference, It's easy to reduce back to normal. But the rule of thumb is that your training audio should be as dry as possible (no reverbs, no echoing, no chorus or voice doubling) and as sharp as possible. If you don't have access to high quality recording studio for your dataset, then Voicefixer can do wonders - even for mediocre quality audio.

Also, fun fact... blending 2 models (smooth and a sharp) together eventually gave me the best model that actually still manages to blow me away. An untrained ear couldn't even suspect that the voice isn't even real or that your audio recording isn't coming from some fancy studio. The inferenced result will always inherit the qualities and characteristics of the training data set, so it's not just the voice that you're trianing into your model.

harlyh avatar Dec 09 '25 09:12 harlyh

Having completed a couple of 1000-epoch training runs this week

If you're using Applio version 3.4.0+ you're probably wasting 10x more time than needed for a model to sound good.

AznamirWoW avatar Dec 09 '25 09:12 AznamirWoW