Make VoiceAssistant support multi Participants
Feat ; Actualy the voice is only start for one participant ! so to forwad transcription and use other feat in the voice assistant for each participant we have to start many VoiceAssistant or add custom Transcription !
Hey, we plan to support multiple participants for the same Voice Assistant in the future, but this isn't a priority at the moment.
We greatly appreciate contributions!
How's the work on the multiple participants support? That would help greatly!
One of the use cases would be a live translation via agent. Most of the logic is already there in voice assistant
Hey @theomonnom, don't mean to bump an old thread - does this remain unsupported at present?
The switching between multiple participants will be supported in agent 1.0, so there could be multiple participants in the room and there will be an API to select which one to talk to the agent, all the participants can hear from the agent. We should have a beta release very soon.
The switching between multiple participants will be supported in agent 1.0, so there could be multiple participants in the room and there will be an API to select which one to talk to the agent, all the participants can hear from the agent. We should have a beta release very soon.
Great! Thanks, appreciate the update - v cool :)
Hey @longcw, Any updates or corresponding examples?
@nitishymtpl Yes, we just released the agents 1.0 rc, there is an example about how to switch participants https://github.com/livekit/agents/blob/dev-1.0/examples/voice_agents/toggle_io.py#L14
Check here https://github.com/livekit/agents?tab=readme-ov-file#-10-release- to see how to install the rc version.
Hi @longcw , Thanks a lot! however it seems, AI Agent could only hear from one participant not from other participants.
You may create multiple RoomIO instances, only one of them has the output enabled and each connects audio input to a participant, but you need then to handle the multiple audio inputs, like mixing them or sending them to different STT?
Hi @longcw, Only problem is AI agent could listen to only one participant. Since AI agent is sending the output into room, both participants could hear the response of AI Agent.
This could be an interesting use case to have multi user in the room.
Reading audio streams and mixing them from multiple participants is feasible, but the problem is how the AI agent to hear from both, is that a realtime API or a STT model takes the audio? You will need the model to support distinguishing the speakers.
We could use any one of them: realtime API or STT pipeline.
This could be an interesting use case to have multi user in the room.
Reading audio streams and mixing them from multiple participants is feasible, but the problem is how the AI agent to hear from both, is that a realtime API or a STT model takes the audio? You will need the model to support distinguishing the speakers.
I don’t think any publicly available LLM API currently supports multi-participant audio natively, and I doubt we’ll see that change soon. However, here are a few possible approaches:
-
The LLM or speech-to-text model supports speaker diarization (best case). This allows it to distinguish between different speakers automatically.
-
The model accepts metadata along with the audio input. In this case, you could manually label each speaker and prepend their identity before merging the audio. This would require some preprocessing.
-
A complex and hacky workaround (not recommended for anything beyond experimental stages): you could prepend each participant’s audio with a short TTS-generated snippet like “Username: {username}” and cache these in memory. Then, when a participant speaks, you detect the activity, insert their identifier snippet before their actual audio, and send the whole thing to the LLM. For example, the final audio sent might sound like: “Username: Alice. Hey, can you find my data?” This gives the LLM a chance to infer who is speaking. But again, this approach is messy and error-prone, and should really only be used in prototypes.
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
Hello guys do we have any update regarding this? It would be extremely useful if you could provide an example of how we can have one agwnt that can support conversations with multiple participants!
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms.
You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py
There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
Good job. I'll test it
Le mer. 2 juil. 2025, 22:40, Orkhan Aliyev @.***> a écrit :
OrkhanGG left a comment (livekit/agents#391) https://github.com/livekit/agents/issues/391#issuecomment-3029530319
Hey @OrkhanGG https://github.com/OrkhanGG, @theomonnom https://github.com/theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialog and a few additional parameters for multi-participant audio rooms.
You can check out this example to get started:
https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py
There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
— Reply to this email directly, view it on GitHub https://github.com/livekit/agents/issues/391#issuecomment-3029530319, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTMWFNZCL6JCY3TK3TIIZT3GRNWVAVCNFSM6AAAAABZOPJ5USVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRZGUZTAMZRHE . You are receiving this because you authored the thread.Message ID: @.***>
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms.
You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py
There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!
Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants.
If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!
Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants.
If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.
@OrkhanGG How is this going to work with the Gemini Live API, given that it includes built-in VAD-based turn detection, which is currently the only supported method? Is there any way to make it manual? can you provide an example.
Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?
I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.
Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!
Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants. If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.
@OrkhanGG How is this going to work with the Gemini Live API, given that it includes built-in VAD-based turn detection, which is currently the only supported method? Is there any way to make it manual? can you provide an example.
The project I’m working on is private, but I’ll put together a public demo soon and share it here unless the LiveKit team beats me to it! Hopefully, it’ll clear up your main concern about the “two drivers, one steering wheel” (two VADs and one LLM) setup.
Will be amazing have something like this, @OrkhanGG let us know if that works for you.