RealtimeTTS coqui engine is unusable

I've got a problem I've been trying to figure out for 2 weeks now and cannot get it to work.

The Coqui Engine is basically unusable. Time for synthesis takes more than 30 seconds per sentence. I've got all the dependencies installed, as well as Cuda.

engine = CoquiEngine( device="cuda", language="de", level=logging.INFO, local_models_path=r"C:\Users\Fuat\Desktop\Realtime SST\cacheCustom" ) engine.set_voice("Damien Black")

Also, I've tried switching the model to a different one, but the engine outputs an error once you set the model_name or specific_model to anything other than xtts2...

My PC specs are:

CPU: AMD Ryzen 5 5600X 6-Core Processor 3.70 GHz GPU: RTX 3060 12GB RAM: 32 G

I'm pretty lost on this, so any help would be appreciated!

Dec 24 '24 09:12 FuatW

"Coqui engine is unusable" sounds a bit harsh. Your hardware should be more than enough to synthesize a sentence in a few seconds. My guess? You've installed CUDA but didn’t configure PyTorch to actually use it. Check the instructions here:
https://github.com/KoljaB/RealtimeTTS?tab=readme-ov-file#cuda-installation

Run this and let me know what it says:

import torch
print("CUDA is available!" if torch.cuda.is_available() else "CUDA is not available.")

If CUDA is installed properly, try enabling DeepSpeed for a speed boost (almost 2x faster):

pip install torch==2.1.2+cu121 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install https://github.com/daswer123/deepspeed-windows-wheels/releases/download/11.2/deepspeed-0.11.2+cuda121-cp310-cp310-win_amd64.whl

Here’s a quick test script with extended logging:

if __name__ == "__main__":
    from RealtimeTTS import TextToAudioStream, CoquiEngine
    import time

    def dummy_generator():
        yield "Hey guys! These here are realtime spoken sentences based on local text synthesis. "
        yield "With a local, neuronal, cloned voice. So every spoken sentence sounds unique."

    import logging
    logging.basicConfig(level=logging.INFO)
    engine = CoquiEngine(level=logging.INFO, use_deepspeed=True)

    stream = TextToAudioStream(engine, muted=True)

    print("Starting to play stream")

    start_time = time.time()
    stream.feed(dummy_generator()).play(log_synthesized_text=True, muted=True, output_wavfile=stream.engine.engine_name + "_output.wav")
    end_time = time.time()

    print(f"Time taken for play command: {end_time - start_time:.2f} seconds")

    engine.shutdown()

You should see something like this in the output:

[INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)

For comparison, on my 4090, I get:

Time taken for play command: 3.62 seconds

That’s for a 16-second generated audio file, translating to a real-time factor of 0.22625. Your RTX 3060 should easily manage a real-time factor below 1.

So yeah, the engine is definitely not "unusable." A project like OpenInterpreter 01, which has 5,000+ GitHub stars, wouldn’t rely on it if that were the case.

Let’s figure this out. 😊

Dec 24 '24 11:12 KoljaB

I'm gonna try this too. The output is currently only speaking a few words then pauses plays a few more then pauses. Even with your couqui example script. I installed everything per instructions in a new folder and virtual env. I'll get back with results.

Jan 21 '25 18:01 MercyfulKing

Its working now. Apparently had my torch install didn't have cuda. Was able to install torch with cuda 12.1 from https://pytorch.org/get-started/locally/, then Deepspeed 0.13.1 for python 3.11 from the dawser123 repo. I ran the coqui example again and its buttery smooth.

Jan 21 '25 19:01 MercyfulKing

I had similar issues with coqui pausing mid sentence, and sometimes just failing to complete, but ROCm/HIP. This was accompanied by MIOpen warnings like:

MIOpen(HIP): Warning [IsEnoughWorkspace] [GetSolutionsFallback WTI] Solver <GemmFwdRest>...
MIOpen(HIP): Warning [IsEnoughWorkspace] [EvaluateInvokers] Solver <GemmFwdRest>...

I spent a lot of time trying out other engines because of it, but gave it another shot today after seeing a workaround to change the MIOpen find mode. Setting this in the environment before running coqui_test.py seems to work really well:

export MIOPEN_FIND_MODE=FAST

I'm running this from Arch Linux with ROCm 6.4.1 a Radeon RX 6650 XT in a Python 3.12 venv, and before installing realtimetts, I installed the ROCm pytorch via:

pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.4/

I hope this helps anyone facing similar issues!

Jul 04 '25 05:07 Benzolio