They clones sound Indian.
I've noticed when using non-native voices to clone, I get Indian accents.. like really heavy Indian accents. How can I prevent that?
import torchaudio as ta
from chatterbox.tts import ChatterboxTTS
model = ChatterboxTTS.from_pretrained(device="cuda")
text = "Hello!! My name is Jack, I'm 23 and I study Arts. My number is 1000000812"
# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH="myvoice.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH,cfg_weight=0.3)
ta.save("output1.wav", wav, model.sr)
Yes, can confirm that a lot of generated audio with TTS receives an Indian accent for some reason. Some others receive a strong british accent. Seems like there is a setting we are missing or something else.
I'm getting an American accent (using an Australian voice).
Hey folks would you be open to sharing the reference clips of the voices you're trying to clone? This will help us improve the model.
What I can tell you is that, when using Egyptian Arabic, fast-paced voices, it's very likely you'll get an Indian accent output.
Getting an Australian or South African accent no matter which reference audio I use, this is quite unusable.
Can you share the reference audio file @mbroonk that would actually help us look into it. Thanks!
Can you share the reference audio file @mbroonk that would actually help us look into it. Thanks!
Here's one I have access to now, can add more later https://filebin.net/hg46jnpq39e9jw7o
Should also say, it gets everything except the accent spot on, exceptionally similar voice especially with exaggeration=0.6
@TediPapajorgji Here's the audio reference (sasa.wav) and output (twtwtw.wav) https://limewire.com/d/ySVOF#70pY5tKa1K
@TediPapajorgji https://freesound.org/people/su1c1d0/sounds/531689/ The cloned voice doesn't even sound like the original. Also, the voice would sound American in one sentence and British in another.
Demos look awesome. However, after, trying three times with zero shot of an Irish female - first time with 0.5 exaggeration, heavy British accent, second time with 1.0 exag, heavy Indian accent, third time with 0.25 exag, most resembled the voice, but an American accent. None of them really match the reference voice, let alone the accent. Any suggestions for how we can fine tune this with LoRa?
Thanks @Saran33 - improving accent capture for zero-shot cloning is on the immediate roadmap for us! Stay tuned.
I can confirm that when I tried Turkish voices as source,it has an extreme Indian accent. This is very unfortunate but I believe with more data this can be handled. I also wish the processing speed shall be faster. It takes too much time to generate the result for something which can be counted as a small text on 3090. Still this as an open source project deserves the most love from me. Great work, needs improvement on some areas... Respect!
I've run into the same issue, and here's what worked for me: First, generate a WAV file with the cloned voice already speaking in the accent you want. That becomes your reference audio for future TTS generations.
To get that initial file, use a short greeting or monologue written in the style and tone of someone with the desired accent. Here's what I mean:
For an Indian accent:
Hi, I’m Aarav Sharma! I’m from Mumbai, India – born and raised in a city that never sleeps...
For a British accent:
Hello there, I'm James Whitmore. Born and bred in the heart of Oxfordshire, I’ve been told my accent could narrate documentaries...
For an American accent:
Hey there, I’m Jake Miller. I grew up just outside of Chicago, so yeah—you might catch a bit of that Midwestern twang...
Once you generate the right-sounding WAV file using one of these intros, feed that file back into your system as the reference voice for future TTS generations. This helps “lock in” the correct accent early on and prevents the model from defaulting to a heavy Indian accent when cloning non-native voices.
Hope that helps!