llama-runner [Improvement] TTS and STT endpoints

It would be nice to add /v1/audio/transcriptions, /v1/audio/translations and /v1/audio/speech endpoints. I'm currently working on /v1/audio/transcriptions but my wifi network is dead and I can't continue working on it yet. For STT I'm planning to use whisper.cpp and maybe F5-TTS for speech synthesis.

Aug 12 '25 19:08 hardWorker254

TTS looks reasonable, since whisper.cpp has whisper-server, so we can run the Whisper model from there. Also, llama.cpp has support for some TTS models, though not through the server endpoint.

With STT, it's more difficult. Apart from the very old OpenTTS, I haven't been able to find good servers for STT (and believe me, I've tried). If you have any proposals, I'm open (but integrating with Python-based STT frameworks would be very ugly due to the environment setup issues they usually have).

Aug 12 '25 19:08 pwilkin

Finished writing /v1/audio/transcription. Approximate logic of the code: A request comes in for /v1/audio/transcription Converted to 16-bit Wav format using ffmpeg (I couldn't find another way except calling subprocess.run, so the code requires ffmpeg installed) Starts whisper-server Sends a request to http://host:port/inference Gets a response Stops whisper-server Respond to the previously received request

All that's left is to somehow connect this with PySide6 for the GUI. I've never worked with PySide6, so this will take a very, very long time.

In ~/.llama-runner there will be 2 files: temp_input_audio temp.wav. 1 was used to convert to wav, and the second to send a request to whisper-server.

The configuration file now looks like this: { "llama-runtimes": { ... }, "models": { ... }, "proxies": { ... }, "audio": { "runtimes": { "default": { "runtime": "/home/prof/whisper.cpp/build/bin/whisper-server" } }, "models": { "small": { "model_path": "/home/prof/whisper.cpp/models/ggml-small.bin", "runtime": "default", "parameters": { "host": "localhost", "port": "4600", "language": "auto" } } } } }

Aug 14 '25 19:08 hardWorker254

Any reason why you want to immediately stop the server? I'd see it more like another instance - start a whisper server on demand if needed, stop it if any other model needs serving.

You don't need to tie this into PySide6, I made an abstraction some time ago, so now all server-starting runs through llama_runner_manager.py and llama-runner-thread.py, only parsing the port number / server readiness message specific to whisper-server would have to get added.

Aug 14 '25 20:08 pwilkin

I'll try to make whisper-server not stop immediately after receiving a request tomorrow. I'll also look at llama_runner_manager.py and llama-runner-thread.py files later.

Aug 14 '25 20:08 hardWorker254

I wrote a raw version and added it to my fork. We need to run tests. I will add /v1/audio/translations in the near future. https://github.com/hardWorker254/llama-runner-tools-fix I will also double-check the code and translate the comments into English.

Aug 15 '25 14:08 hardWorker254

Cool! Doing a refactoring now to fix the thread launching logic, I'll try to merge when I'm done with that.

Aug 17 '25 11:08 pwilkin