chatterbox icon indicating copy to clipboard operation
chatterbox copied to clipboard

They clones sound Indian.

Open Maxdha opened this issue 6 months ago • 13 comments

I've noticed when using non-native voices to clone, I get Indian accents.. like really heavy Indian accents. How can I prevent that?

import torchaudio as ta
from chatterbox.tts import ChatterboxTTS

model = ChatterboxTTS.from_pretrained(device="cuda")

text = "Hello!! My name is Jack, I'm 23 and I study Arts. My number is 1000000812"

# If you want to synthesize with a different voice, specify the audio prompt
AUDIO_PROMPT_PATH="myvoice.wav"
wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH,cfg_weight=0.3)
ta.save("output1.wav", wav, model.sr)

Maxdha avatar May 29 '25 10:05 Maxdha

Yes, can confirm that a lot of generated audio with TTS receives an Indian accent for some reason. Some others receive a strong british accent. Seems like there is a setting we are missing or something else.

ksomml avatar May 29 '25 12:05 ksomml

I'm getting an American accent (using an Australian voice).

lochstar avatar May 29 '25 23:05 lochstar

Hey folks would you be open to sharing the reference clips of the voices you're trying to clone? This will help us improve the model.

TediPapajorgji avatar May 30 '25 09:05 TediPapajorgji

What I can tell you is that, when using Egyptian Arabic, fast-paced voices, it's very likely you'll get an Indian accent output.

Maxdha avatar May 30 '25 09:05 Maxdha

Getting an Australian or South African accent no matter which reference audio I use, this is quite unusable.

mbroonk avatar May 30 '25 17:05 mbroonk

Can you share the reference audio file @mbroonk that would actually help us look into it. Thanks!

TediPapajorgji avatar May 30 '25 18:05 TediPapajorgji

Can you share the reference audio file @mbroonk that would actually help us look into it. Thanks!

Here's one I have access to now, can add more later https://filebin.net/hg46jnpq39e9jw7o

Should also say, it gets everything except the accent spot on, exceptionally similar voice especially with exaggeration=0.6

mbroonk avatar May 30 '25 18:05 mbroonk

@TediPapajorgji Here's the audio reference (sasa.wav) and output (twtwtw.wav) https://limewire.com/d/ySVOF#70pY5tKa1K

Maxdha avatar May 30 '25 18:05 Maxdha

@TediPapajorgji https://freesound.org/people/su1c1d0/sounds/531689/ The cloned voice doesn't even sound like the original. Also, the voice would sound American in one sentence and British in another.

peterhoang avatar May 30 '25 20:05 peterhoang

Demos look awesome. However, after, trying three times with zero shot of an Irish female - first time with 0.5 exaggeration, heavy British accent, second time with 1.0 exag, heavy Indian accent, third time with 0.25 exag, most resembled the voice, but an American accent. None of them really match the reference voice, let alone the accent. Any suggestions for how we can fine tune this with LoRa?

Saran33 avatar May 30 '25 20:05 Saran33

Thanks @Saran33 - improving accent capture for zero-shot cloning is on the immediate roadmap for us! Stay tuned.

TediPapajorgji avatar May 30 '25 21:05 TediPapajorgji

I can confirm that when I tried Turkish voices as source,it has an extreme Indian accent. This is very unfortunate but I believe with more data this can be handled. I also wish the processing speed shall be faster. It takes too much time to generate the result for something which can be counted as a small text on 3090. Still this as an open source project deserves the most love from me. Great work, needs improvement on some areas... Respect!

FlowDownTheRiver avatar Jun 19 '25 01:06 FlowDownTheRiver

I've run into the same issue, and here's what worked for me: First, generate a WAV file with the cloned voice already speaking in the accent you want. That becomes your reference audio for future TTS generations.

To get that initial file, use a short greeting or monologue written in the style and tone of someone with the desired accent. Here's what I mean:

For an Indian accent:

Hi, I’m Aarav Sharma! I’m from Mumbai, India – born and raised in a city that never sleeps...

For a British accent:

Hello there, I'm James Whitmore. Born and bred in the heart of Oxfordshire, I’ve been told my accent could narrate documentaries...

For an American accent:

Hey there, I’m Jake Miller. I grew up just outside of Chicago, so yeah—you might catch a bit of that Midwestern twang...

Once you generate the right-sounding WAV file using one of these intros, feed that file back into your system as the reference voice for future TTS generations. This helps “lock in” the correct accent early on and prevents the model from defaulting to a heavy Indian accent when cloning non-native voices.

Hope that helps!

OleStauning avatar Jul 10 '25 14:07 OleStauning