Feature Request: Ultravox example
Hi,
Ultravox is a suite of open weight models that are designed for getting the time to first token as low as possible with audio input. Basically they trained a good and fast projector to project the whisper large v3 encoder into llama 4.1 LLMs, both in 8B and 70B size.
I think it would be a great fit for livekit's agents so it would be nice to add an example and demo for it!
Thanks -- happy to work with folks at Livekit to make this happen!
Any update on this?
Hello Everyone, any update on this?
I'm very interested still, especially as nothing as simple as ollama exists for chat (audio+text) but I lack the skills to implement it. I'm still surprised no one has created it since it seems to be in the best interest of all parties involved: livekit, ultravox, kyutai (they made moshi), etc. And everyone seems to advertise their solution as easy to implement.
Hi It will be very easy to implement with our next big release.
Sounds great @jayeshp19 , any approximate idea on dates for that?
In the next week or two, you can track progress here: https://github.com/livekit/agents/pull/1364
@jayeshp19 Any current updates on this? I am not seeing anything related to this in the link provided. Thanks!
I'm looking to use ultravox and livekit would love to know where this project stands.
There isn't any working example yet for the community ?
I am working on a PR for this: https://github.com/livekit/agents/pull/2409
[!WARNING] This is only for Ultravox's paid API service, not the model. If you want to use the model, you will have to host and manage it somewhere yourself. There are many similar model related issues here (#2262, #1724, #962 #1687) but unfortunately, I think that's out of the current scope.
Any update on this? Would really like ultravox work with livekit
Is any update on the livekit support the ultravox
+1
hey @jayeshp19 - just following up on this
Would be great if it would support fine-tuned versions of the model too (fine tuned using custom data to support new languages like cantonese) (still investigating how this would work on a technical level)
Any update ?
Any update