LocalAI
LocalAI copied to clipboard
feat(realtime): Add audio conversations
Description
Add enough realtime API features to allow talking with an LLM using only audio.
Presently the realtime API only supports transcription which is a minor use-case for it. This PR should allow it to be used with a basic voice assistant.
This PR will ignore many of the options and edge-cases. Instead it'll just, for e.g., rely on server side VAD to commit conversation items.
Notes for Reviewers
- [ ] Configure a model pipeline or use a multi-modal model.
- [ ] Commit client audio to the conversation
- [ ] Generate a text response (optional)
- [ ] Generate an audio response
- [ ] Interrupt generation on voice detection?
Fixes: #3714 (but we'll need follow issues)
Signed commits
- [x] Yes, I signed my commits.
Deploy Preview for localai ready!
| Name | Link |
|---|---|
| Latest commit | c1b9f23214552c494f6fac5c3c0b730e9ab09e8b |
| Latest deploy log | https://app.netlify.com/projects/localai/deploys/68d03d019818140008d2dae5 |
| Deploy Preview | https://deploy-preview-6245--localai.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify project configuration.
It's not clear to me if we have audio support in llama.cpp: https://github.com/ggml-org/llama.cpp/discussions/15194
https://github.com/ggml-org/llama.cpp/discussions/13759
https://github.com/ggml-org/llama.cpp/pull/13784
my initial thought on this was to use the whisper backend for transcribing from VAD, and give the text to a text-to-text backend, this way we can always go back at this. There was also an interface created exactly for this so a pipeline can be kinda seen as a "drag and drop" until omni models are really capable.
However, yes audio input is actually supported by llama.cpp and our backends, try qwen2-omni, you will be able to give it an audio as input, but isn't super accurate (better transcribing for now).
OK, I tried Qwen 2 omni and had issues with accuracy and context length which aren't a problem for a pipeline.