LocalAI feat(realtime): Add audio conversations

Description

Add enough realtime API features to allow talking with an LLM using only audio.

Presently the realtime API only supports transcription which is a minor use-case for it. This PR should allow it to be used with a basic voice assistant.

This PR will ignore many of the options and edge-cases. Instead it'll just, for e.g., rely on server side VAD to commit conversation items.

Notes for Reviewers

[ ] Configure a model pipeline or use a multi-modal model.
[ ] Commit client audio to the conversation
[ ] Generate a text response (optional)
[ ] Generate an audio response
[ ] Interrupt generation on voice detection?

Fixes: #3714 (but we'll need follow issues)

Signed commits

[x] Yes, I signed my commits.

Sep 10 '25 14:09 richiejp

Deploy Preview for localai ready!

Name	Link
Latest commit	c1b9f23214552c494f6fac5c3c0b730e9ab09e8b
Latest deploy log	https://app.netlify.com/projects/localai/deploys/68d03d019818140008d2dae5
Deploy Preview	https://deploy-preview-6245--localai.netlify.app
Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Sep 10 '25 14:09 netlify[bot]

It's not clear to me if we have audio support in llama.cpp: https://github.com/ggml-org/llama.cpp/discussions/15194

Sep 13 '25 10:09 richiejp

https://github.com/ggml-org/llama.cpp/discussions/13759

Sep 13 '25 10:09 richiejp

https://github.com/ggml-org/llama.cpp/pull/13784

Sep 13 '25 10:09 richiejp

my initial thought on this was to use the whisper backend for transcribing from VAD, and give the text to a text-to-text backend, this way we can always go back at this. There was also an interface created exactly for this so a pipeline can be kinda seen as a "drag and drop" until omni models are really capable.

However, yes audio input is actually supported by llama.cpp and our backends, try qwen2-omni, you will be able to give it an audio as input, but isn't super accurate (better transcribing for now).

Sep 21 '25 16:09 mudler

OK, I tried Qwen 2 omni and had issues with accuracy and context length which aren't a problem for a pipeline.

Sep 21 '25 17:09 richiejp

LocalAI LocalAI copied to clipboard

feat(realtime): Add audio conversations

✅ Deploy Preview for localai ready!

LocalAI
LocalAI copied to clipboard

Deploy Preview for localai ready!