RealtimeTTS icon indicating copy to clipboard operation
RealtimeTTS copied to clipboard

Sound Glitches During Realtime Synthesis – Audio Stops and Starts Abruptly

Open Panther465 opened this issue 9 months ago • 13 comments

Hello Community,

When using the CoquiEngine in RealtimeTTS for realtime text-to-speech, the synthesized audio is choppy. The sound intermittently stops between words or segments, resulting in a disruptive playback experience.

Steps to Reproduce:

  1. Use a realtime TTS script (see below) that feeds text either in small chunks or via a generator.
  2. Run the script on a system with GPU support (e.g., RTX 4060 with CUDA enabled).
  3. Observe that during playback, the audio “comes and goes” with noticeable gaps between words or sentences.

Expected Behavior: The synthesized speech should be continuous and smooth, without abrupt pauses or intermittent glitches.

Actual Behavior: The playback is discontinuous—audio frequently stops and then resumes, causing a choppy, glitchy experience

This is my code i am using:


import os
import time
import torch
from RealtimeTTS import TextToAudioStream, CoquiEngine

def realtime_text_generator():
    texts = [
        "Hello, this is real-time TTS speaking. ",
        "Every sentence is synthesized as soon as it is ready. ",
        "The voice is generated using a local, neural cloned model. "
    ]
    for text in texts:
        yield text
        time.sleep(0.1)  # simulate continuous input with a short delay

if __name__ == "__main__":
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Optionally, specify custom model parameters via environment variables.
    custom_model_path = os.getenv("CUSTOM_COQUI_MODEL_PATH", None)
    custom_model_name = os.getenv("CUSTOM_COQUI_MODEL_NAME", None)

    if custom_model_path:
        print(f"Using custom model from: {custom_model_path}")
        engine = CoquiEngine(
            local_models_path=custom_model_path,
            specific_model=custom_model_name,
            full_sentences=True
        )
    else:
        print("Using default model settings.")
        engine = CoquiEngine()

    stream = TextToAudioStream(engine)
    print("Starting realtime TTS streaming...")
    stream.feed(realtime_text_generator()).play(log_synthesized_text=True)

    while stream.is_playing():
        time.sleep(0.05)

    print("Playback finished.")
    engine.shutdown()

Please someone help me to solve this issue.

Panther465 avatar Mar 04 '25 06:03 Panther465

Code looks good. This is probably the GPU not being fast enough to synthesize without stuttering. Please try installing deepspeed: Windows:

   pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
   pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl

Linux:

   pip install deepspeed

and then use CoquiEngine with parameter use_deepspeed set to True. This will speed up synthesis around 2x and probably solve the problem.

KoljaB avatar Mar 04 '25 09:03 KoljaB

Hey thanks , It worked for me. I am making a offline ai assistance using ollama , faster whisper and Realtime tts . My hardware is: CPU : i7 14700hx GPU : rtx 4060 Ram : 32 gb

So isn't my hardware not enough to build a realtime ai assistance.

If you have any suggestions to achieve my goal please let me know i am new in this field.😊

Panther465 avatar Mar 04 '25 11:03 Panther465

4060 is a bit close, mostly due to 8GB VRAM which has to carry STT, TTS and LLM model. I'd suggest using KokoroEngine if possible with language bcs it needs fewer VRAM. faster whisper with a smaller model (like medium instead of large-v2).

KoljaB avatar Mar 04 '25 12:03 KoljaB

Ok let me try can i use voice cloning in KokoroEngine ?

Panther465 avatar Mar 04 '25 13:03 Panther465

No, KokoroEngine does not support voice cloning. That's only possible with CoquiEngine and StyleTTSEngine. StyleTTSEngine is a quite good low VRAM alternative to CoquiEngine too. Does not save that much VRAM compared to CoquiEngine though and is harder to install. Both have their own pros and cons.

KoljaB avatar Mar 04 '25 14:03 KoljaB

Hey your Realtime tts also support style tts so i should try it but can you suggest me any proper style tts installation steps so that i can use that and also can you suggest me any good model i can use in style tts i saw your example video of StyleTTS where you are using Nicole model , from where i can download that ? Is that available ? Becoz i am new comer so i don't know how to train my own voice model .

It will be very helpful 😊

Panther465 avatar Mar 04 '25 15:03 Panther465

I suggest just following the original StyleTTS2 Readme to install it. EspeakNG installation is a bit tricky, you can download the espeak-ng.msi installer here. It should create a folder in "C:\Program Files\eSpeak NG" with the exe and libespeak-ng.dll files in it. The folder should be in the %PATH% environment variable, you might need to add it there manually. If you run into problems just post them here and I'll try to help.

For voice cloning probably the LibriTTS model mentioned in the StyleTTS2 repo is better suited than any model which was trained on a specific voice. Just uploaded the Nicole model here so you can try that one too.

KoljaB avatar Mar 04 '25 15:03 KoljaB

Hey i was trying to use StyleTTS for my project but it was giving error so i just tried coqui tts first look i am using this code

import os
import time
import torch
from RealtimeTTS import TextToAudioStream, CoquiEngine

def combined_realtime_text_generator():
    """
    Instead of yielding very short segments, this generator accumulates
    text for a short duration (e.g., 0.3 seconds) and then yields the combined
    text. This helps maintain continuous audio without abrupt gaps.
    """
    texts = [
        "Hello, this is real-time TTS speaking. ",
        "Every sentence is synthesized as soon as it is ready. ",
        "The voice is generated using a local, neural cloned model. "
    ]
    combined = ""
    for text in texts:
        combined += text
        time.sleep(0.1)  # accumulate text segments (adjust delay as needed)
    yield combined

if __name__ == "__main__":
    # Check for CUDA support
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Optionally, allow the use of a custom model via environment variables:
    #   CUSTOM_COQUI_MODEL_PATH and CUSTOM_COQUI_MODEL_NAME.
    custom_model_path = os.getenv("CUSTOM_COQUI_MODEL_PATH")
    custom_model_name = os.getenv("CUSTOM_COQUI_MODEL_NAME")

    if custom_model_path:
        print(f"Using custom model from: {custom_model_path}")
        engine = CoquiEngine(
            local_models_path=custom_model_path,
            specific_model=custom_model_name,  # Set to None if only one model is present.
            full_sentences=True  # Helps reduce stuttering.
        )
    else:
        # Use the specified custom model: tts_models/multilingual/multi-dataset/xtts_v2
        default_model_path = os.path.expanduser(r"C:\Users\Stevenom\AppData\Local\tts\tts_models--multilingual--multi-dataset--xtts_v2")
        print("Using default model: tts_models/multilingual/multi-dataset/xtts_v2")
        engine = CoquiEngine(
            local_models_path=default_model_path,
            specific_model=None,  # Only one model in the directory.
            full_sentences=True
        )

    # Create the realtime text-to-audio stream.
    stream = TextToAudioStream(engine)

    print("Starting realtime TTS streaming...")
    # Feed the combined text from our generator to produce continuous speech.
    stream.feed(combined_realtime_text_generator()).play(log_synthesized_text=True)

    # Wait until playback completes.
    while stream.is_playing():
        time.sleep(0.05)

    print("Playback finished.")
    engine.shutdown()

but i also want to use voice cloning in here but when i edit this code and add voice cloning it give error that instance = super().call(*args, **kwargs) TypeError: CoquiEngine.init() got an unexpected keyword argument 'voice_clone_reference'

Can you tell me how can i use voice cloning in my code ?

Panther465 avatar Mar 05 '25 06:03 Panther465

I test your script like that. I add voice cloning, some infos, and deleted the custom path because I have no idea why you add it. And it work.

import os
import time
import torch
import RealtimeTTS

def combined_realtime_text_generator():
    """
    Instead of yielding very short segments, this generator accumulates
    text for a short duration (e.g., 0.3 seconds) and then yields the combined
    text. This helps maintain continuous audio without abrupt gaps.
    """
    texts = [
        "Hello, this is real-time TTS speaking. ",
        "Every sentence is synthesized as soon as it is ready. ",
        "The voice is generated using a local, neural cloned model. "
    ]
    combined = ""
    for text in texts:
        combined += text
        time.sleep(0.1)  # accumulate text segments (adjust delay as needed)
    yield combined

if __name__ == "__main__":
    # Check for CUDA support
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")



    # Create a "voices" folder in the same folder than this script and put a .wav audio file of the voice you want to clone. 
    # You need a 10 to 30 seconds sample, 44100Hz or 22050Hz mono 32bit float WAV file for best results.
    # The first time using a new voice sample, coqui will generate a new files name YOUR_VOICE_SAMPLE_NAME.json in the voices folder.
    stream = RealtimeTTS.TextToAudioStream(RealtimeTTS.CoquiEngine(language="en", voice="./voices/[YOUR_VOICE_SAMPLE_NAME.wav]"))

    print("Starting realtime TTS streaming...")
    # Feed the combined text from our generator to produce continuous speech.
    stream.feed(combined_realtime_text_generator()).play(log_synthesized_text=True)

    # Wait until playback completes.
    while stream.is_playing():
        time.sleep(0.05)

    print("Playback finished.")

Nenesh avatar Mar 06 '25 16:03 Nenesh

Thanks for replying , I tried your code and this error is coming

Error: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint. (1) In PyTorch 2.6, we changed the default value of the weights_onlyargument intorch.loadfromFalsetoTrue. Re-running torch.loadwithweights_onlyset toFalsewill likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. (2) Alternatively, to load withweights_only=Trueplease check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL TTS.tts.models.xtts.XttsAudioConfig was not an allowed global by default. Please usetorch.serialization.add_safe_globals([XttsAudioConfig])or thetorch.serialization.safe_globals([XttsAudioConfig]) context manager to allowlist this global if you trust this class/function.

i tried to change the version of Pytorch but it didn't work . please help me to solve this. I am new to these thing so don't know much about it.🫠

And you was asking that why i added the custom model path i want to use "xtts_v2" that's why .

Panther465 avatar Mar 06 '25 17:03 Panther465

What version of torch are you using ? I'm using 2.1.2+cu118 and don't have this problem. Edit: https://github.com/suno-ai/bark/pull/619 => os.environ['TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD'] = '1'
Adding this should fix the problem (you will get a warning message, not an error).

CoquiEngine already use xtts_v2 as base parameter. This is all the base parameter of CoquiEngine (you can find them in coqui_engine.py, right click on CoquiEngine and "Go to Definition")

class CoquiEngine(BaseEngine):
    def __init__(
        self,
        model_name="tts_models/multilingual/multi-dataset/xtts_v2",
        specific_model="v2.0.2",
        local_models_path=None,
        voices_path=None,
        voice: Union[str, List[str]] = "",
        language="en",
        speed=1.0,
        thread_count=6,
        stream_chunk_size=20,
        overlap_wav_len=1024,
        temperature=0.85,
        length_penalty=1.0,
        repetition_penalty=7.0,
        top_k=50,
        top_p=0.85,
        enable_text_splitting=True,
        full_sentences=False,
        level=logging.WARNING,
        use_deepspeed=False,
        device: str = None,
        prepare_text_for_synthesis_callback=None,
        add_sentence_filter=False,
        pretrained=False,
        comma_silence_duration=0.3,
        sentence_silence_duration=0.6,
        default_silence_duration=0.3,
        print_realtime_factor=False,
        load_balancing=False,
        load_balancing_buffer_length=0,
        load_balancing_cut_off=0,
    )

Nenesh avatar Mar 06 '25 17:03 Nenesh

Thanks Bro Work for me now❤️

Hey i also wants to add hindi language in it when LLM generate response in hindi, it should give speech output in hindi how can i do that.

Panther465 avatar Mar 07 '25 04:03 Panther465

You want it to speak 2 languages ? English when generating response in english and hindi when generation in hindi ? Problem: If you give, as a sample, an english or french voice, the voice can't speak hindi (french, german and english work great with, for exemple, just a french or english sample, but not hindi).

So this is what I think:

  • Use "langdetect" to detect the LLM language response.
  • Pass the language parameter to CoquiEngine (so CoquiEngine "know" what language to "speak").
  • Use the language code (fr, en, hi, etc etc etc) to pass a different sample voice path to CoquiEngine.

If you want the same voice for different language, you need a sample of the same person/voice speaking in different language.

Just make this code really quick so you can see what I'm talking about:

import os
import time
import torch
import RealtimeTTS
from langdetect import detect

llm_language = ""
os.environ['TORCH_FORCE_NO_WEIGHTS_ONLY_LOAD'] = '1'
def combined_realtime_text_generator():
    """
    Instead of yielding very short segments, this generator accumulates
    text for a short duration (e.g., 0.3 seconds) and then yields the combined
    text. This helps maintain continuous audio without abrupt gaps.
    """
    global llm_language
    texts = [
        "Hello, this is real-time TTS speaking. ",
        "Every sentence is synthesized as soon as it is ready. ",
        "The voice is generated using a local, neural cloned model. "

    ]
    combined = ""
    for text in texts:
        combined += text
        time.sleep(0.1)
    

    try:
        llm_language = detect(combined)
        print(f"Detected Language: {llm_language}")
    except:
        print("Impossible to detect language, en by default")
        llm_language = "en"
    
    yield combined

def get_voice_path(language_code):
    """
    Return the path of the appropriate voice/language based on language code
    """
    # Just put different language speaking sample in the voices folder. This languages are for test, delet what you don't want.
    voice_mapping = {
        "en": "./voices/en_sample.wav",
        "fr": "./voices/fr_sample.wav",
        "de": "./voices/de_sample.wav",
        "hi": "./voices/hi_sample.wav"
    }
    
    # Return the path of the appropriate voice/language, "en" by default
    return voice_mapping.get(language_code, "./voices/en_sample.wav")

if __name__ == "__main__":
    # Check for CUDA support
    device = "cuda" if torch.cuda.is_available() else "cpu"
    print(f"Using device: {device}")

    # Create a "voices" folder in the same folder than this script and put a .wav audio file of the voice you want to clone. 
    # You need a 10 to 30 seconds sample, 44100Hz or 22050Hz mono 32bit float WAV file for best results.
    # The first time using a new voice sample, coqui will generate a new files name YOUR_VOICE_SAMPLE_NAME.json in the voices folder.
    generator = combined_realtime_text_generator()
    first_text = next(generator) 
    
    # Select voice based on language
    voice_path = get_voice_path(llm_language)
    print(f"Using voice file: {voice_path}")
    
    stream = RealtimeTTS.TextToAudioStream(RealtimeTTS.CoquiEngine(language=llm_language, voice=voice_path))
    
    print("Starting realtime TTS streaming...")
    stream.feed([first_text]).play(log_synthesized_text=True)
    
    while stream.is_playing():
        time.sleep(0.05)
        
    print("Playback finished.")

Nenesh avatar Mar 07 '25 14:03 Nenesh