RealtimeTTS icon indicating copy to clipboard operation
RealtimeTTS copied to clipboard

StyleTTSEngine language in phonemizer is hardcoded to English (en-us)

Open kohai-channel opened this issue 7 months ago • 3 comments

Hi, I noticed that in StyleTTSEngine, the language used in the phonemizer is hardcoded to English (en-us), which prevents using the engine with models trained for other languages.

Specifically, this line in the code:

# Initialize phonemizer
self.global_phonemizer = phonemizer.backend.EspeakBackend(language='en-us', 
                                                          preserve_punctuation=True,
                                                          with_stress=True)

I have a fine-tuned model that speaks Spanish, but I can't use it properly with StyleTTSEngine because of this limitation. It would be great if the phonemizer language could be configurable or inferred from the model's settings.

Thanks!

kohai-channel avatar May 25 '25 18:05 kohai-channel

I'll take care of this in the next release. Do you by any chance know good resources how to fine tune on another language? Tried for german but failed so far...

KoljaB avatar May 25 '25 18:05 KoljaB

@KoljaB i am using styletts2 with realtimeTTS engine for my native language Hindi which i fine tuned on my own data !! i am currently facing issue with the language_switch ( hi ) nəmˈʌsteː dˈʊnɪjˌaː ( en-us ) , ( hi ) sʈˌaːɪlʈiʈiˈeːs ˈɔɖɪjˌoː kaː pəɾˈiːkʃəɳ kˈɪjaː ɟˈaː ɾˌəhaː hɛː ( en-us ) i tried to remove it in this way self.global_phonemizer = phonemizer.backend.EspeakBackend( language='hi', preserve_punctuation=True, with_stress=False, language_switch='remove-flags', # This should remove the flags words_mismatch='ignore', # Add these additional settings to suppress warnings punctuation_marks=';:,.!?¡¿—…"«»""', strip=True )
having this language_switch='remove-flags', # This should remove the flags still its synthesizing in this way!! where i am doing wrong??? also i have created styletts2 tts service class supported by RealtimeTTS engine for pipecat !! whole logs

2025-05-29 11:36:16.996 | INFO     | __main__:start:159 - StyleTTSService#0: TextToAudioStream created successfully
2025-05-29 11:36:16.996 | INFO     | __main__:start:166 - StyleTTSService#0: StyleTTS initialization completed successfully in 8.82s
2025-05-29 11:36:16.996 | INFO     | __main__:run_tts:226 - StyleTTSService#0: Starting TTS generation for: [नमस्ते दुनिया, स्टाइलटीटीएस ऑडियो का परीक्षण किया ...]
2025-05-29 11:36:16.996 | DEBUG    | __main__:run_tts:241 - StyleTTSService#0: Processing text: [नमस्ते दुनिया, स्टाइलटीटीएस ऑडियो का परीक्षण किया जा रहा है]
2025-05-29 11:36:16.996 | DEBUG    | __main__:run_tts:249 - StyleTTSService#0: Starting audio streaming...
2025-05-29 11:36:16.996 | DEBUG    | __main__:_stream_audio_realtime:269 - StyleTTSService#0: Setting up 200ms buffered audio streaming for text: [नमस्ते दुनिया, स्टाइलटीटीएस ऑड...]
2025-05-29 11:36:16.996 | INFO     | __main__:run_synthesis:308 - StyleTTSService#0: Starting synthesis thread for text: [नमस्ते दुनिया, स्टाइलटीटीएस ऑड...]
⚡ synthesizing → 'नमस्ते दुनिया, स्टाइलटीटीएस ऑडियो का परीक्षण किया जा रहा है'
WARNING:phonemizer:2 utterances containing language switches on lines 1, 2
WARNING:phonemizer:extra phones may appear in the "en-us" phoneset
WARNING:phonemizer:language switch flags have been kept (applying "keep-flags" policy)
( hi ) nəmˈʌsteː dˈʊnɪjˌaː ( en-us ) , ( hi ) sʈˌaːɪlʈiʈiˈeːs ˈɔɖɪjˌoː kaː pəɾˈiːkʃəɳ kˈɪjaː ɟˈaː ɾˌəhaː hɛː ( en-us )

New Padding length bert_dur_2: 109
SYNTHESIS FINISHED

sachin7695 avatar May 29 '25 11:05 sachin7695

I'm sorry I have no time for anything currently. That can take a while until I can look into that.

KoljaB avatar May 29 '25 11:05 KoljaB