LocalAI
LocalAI copied to clipboard
feat: Realtime API support
Description
This PR fixes https://github.com/mudler/LocalAI/issues/3714
And also covers #191
Notes for Reviewers
Signed commits
- [ ] Yes, I signed my commits.
Deploy Preview for localai ready!
| Name | Link |
|---|---|
| Latest commit | f272605b950d35e4360d638a9b30fa7e343749e4 |
| Latest deploy log | https://app.netlify.com/sites/localai/deploys/67868d4d9141c90008d963f5 |
| Deploy Preview | https://deploy-preview-3722--localai.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
yamllint Failed
Show Output
::group::gallery/arch-function.yaml
::error file=gallery/arch-function.yaml,line=66,col=22::66:22 [new-line-at-end-of-file] no new line character at the end of file
::endgroup::
Workflow: Yamllint GitHub Actions, Action: __karancode_yamllint-github-action, Lint: gallery
Just for reference, openai-realtime-console seems quite nice for testing things out especially at this stage, I've opened up a PR upstream to include a Dockerfile and instructions on how to use it with a local server: https://github.com/openai/openai-realtime-console/pull/59
whats best option here if we want to contribute just make forks of the branch and PRS against this?
whats best option here if we want to contribute just make forks of the branch and PRS against this?
Yes, that would work just fine!
What is done:
- [x] API spec
- [x] Updating session, starting VAD server
- [x] Hooking server API specs to placeholder functions
- [x] Register ws server, and test client side functionalities
- [x] Created a wrapped model definition for emulating Audio-to-Audio models when backend does not support it (via SST->LLM->TTS pipeline)
things left:
- [ ] handling conversations templating (like we do in chat.go, this a good opportunity to do some code extraction)
- [ ] add a VAD backend, or embed directly VAD in the current golang code. having a backend would make it modular and re-use part of the existing code base
- [ ] Add Audio-to-Audio backend and define the gRPC APIs for it. Implement usage here
- [ ] Hook the model interface to the various Backends functions, and update the wrapped model so it works both when emulating an Audio-to-Audio model (by running things in a pipeline: SST -> LLM -> TTS) and Audio-to-Audio
Currently at creating the VAD backend with silero, attach it to the compilation process and to the binary releases
mh. things are in the good direction but still VAD isn't right, it detects the start of the conversation, but can't detect the end segment yet.
Extracted silero-vad bits over here: https://github.com/mudler/LocalAI/pull/4204 so can be tackled separately
closing as we merged it in https://github.com/mudler/LocalAI/pull/5392