agents icon indicating copy to clipboard operation
agents copied to clipboard

Make VoiceAssistant support multi Participants

Open Nicj228 opened this issue 1 year ago • 14 comments

Feat ; Actualy the voice is only start for one participant ! so to forwad transcription and use other feat in the voice assistant for each participant we have to start many VoiceAssistant or add custom Transcription !

Nicj228 avatar Jun 25 '24 09:06 Nicj228

Hey, we plan to support multiple participants for the same Voice Assistant in the future, but this isn't a priority at the moment.

theomonnom avatar Jul 01 '24 22:07 theomonnom

We greatly appreciate contributions!

theomonnom avatar Jul 01 '24 22:07 theomonnom

How's the work on the multiple participants support? That would help greatly!

One of the use cases would be a live translation via agent. Most of the logic is already there in voice assistant

Shandelier avatar Sep 18 '24 13:09 Shandelier

Hey @theomonnom, don't mean to bump an old thread - does this remain unsupported at present?

RobMaye avatar Mar 20 '25 21:03 RobMaye

The switching between multiple participants will be supported in agent 1.0, so there could be multiple participants in the room and there will be an API to select which one to talk to the agent, all the participants can hear from the agent. We should have a beta release very soon.

longcw avatar Mar 21 '25 00:03 longcw

The switching between multiple participants will be supported in agent 1.0, so there could be multiple participants in the room and there will be an API to select which one to talk to the agent, all the participants can hear from the agent. We should have a beta release very soon.

Great! Thanks, appreciate the update - v cool :)

RobMaye avatar Mar 21 '25 14:03 RobMaye

Hey @longcw, Any updates or corresponding examples?

nitishymtpl avatar Mar 27 '25 13:03 nitishymtpl

@nitishymtpl Yes, we just released the agents 1.0 rc, there is an example about how to switch participants https://github.com/livekit/agents/blob/dev-1.0/examples/voice_agents/toggle_io.py#L14

Check here https://github.com/livekit/agents?tab=readme-ov-file#-10-release- to see how to install the rc version.

longcw avatar Mar 27 '25 13:03 longcw

Hi @longcw , Thanks a lot! however it seems, AI Agent could only hear from one participant not from other participants.

nitishymtpl avatar Mar 27 '25 15:03 nitishymtpl

You may create multiple RoomIO instances, only one of them has the output enabled and each connects audio input to a participant, but you need then to handle the multiple audio inputs, like mixing them or sending them to different STT?

longcw avatar Mar 28 '25 01:03 longcw

Hi @longcw, Only problem is AI agent could listen to only one participant. Since AI agent is sending the output into room, both participants could hear the response of AI Agent.

nitishymtpl avatar Mar 28 '25 02:03 nitishymtpl

This could be an interesting use case to have multi user in the room.

Reading audio streams and mixing them from multiple participants is feasible, but the problem is how the AI agent to hear from both, is that a realtime API or a STT model takes the audio? You will need the model to support distinguishing the speakers.

longcw avatar Mar 28 '25 02:03 longcw

We could use any one of them: realtime API or STT pipeline.

nitishymtpl avatar Mar 28 '25 02:03 nitishymtpl

This could be an interesting use case to have multi user in the room.

Reading audio streams and mixing them from multiple participants is feasible, but the problem is how the AI agent to hear from both, is that a realtime API or a STT model takes the audio? You will need the model to support distinguishing the speakers.

I don’t think any publicly available LLM API currently supports multi-participant audio natively, and I doubt we’ll see that change soon. However, here are a few possible approaches:

  1. The LLM or speech-to-text model supports speaker diarization (best case). This allows it to distinguish between different speakers automatically.

  2. The model accepts metadata along with the audio input. In this case, you could manually label each speaker and prepend their identity before merging the audio. This would require some preprocessing.

  3. A complex and hacky workaround (not recommended for anything beyond experimental stages): you could prepend each participant’s audio with a short TTS-generated snippet like “Username: {username}” and cache these in memory. Then, when a participant speaks, you detect the activity, insert their identifier snippet before their actual audio, and send the whole thing to the LLM. For example, the final audio sent might sound like: “Username: Alice. Hey, can you find my data?” This gives the LLM a chance to infer who is speaking. But again, this approach is messy and error-prone, and should really only be used in prototypes.

OrkhanGG avatar May 19 '25 17:05 OrkhanGG

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

Khalid1G avatar Jun 22 '25 13:06 Khalid1G

Hello guys do we have any update regarding this? It would be extremely useful if you could provide an example of how we can have one agwnt that can support conversations with multiple participants!

panosmoschos avatar Jul 02 '25 19:07 panosmoschos

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms.

You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py

There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

OrkhanGG avatar Jul 02 '25 22:07 OrkhanGG

Good job. I'll test it

Le mer. 2 juil. 2025, 22:40, Orkhan Aliyev @.***> a écrit :

OrkhanGG left a comment (livekit/agents#391) https://github.com/livekit/agents/issues/391#issuecomment-3029530319

Hey @OrkhanGG https://github.com/OrkhanGG, @theomonnom https://github.com/theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialog and a few additional parameters for multi-participant audio rooms.

You can check out this example to get started:

https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py

There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

— Reply to this email directly, view it on GitHub https://github.com/livekit/agents/issues/391#issuecomment-3029530319, or unsubscribe https://github.com/notifications/unsubscribe-auth/APTMWFNZCL6JCY3TK3TIIZT3GRNWVAVCNFSM6AAAAABZOPJ5USVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRZGUZTAMZRHE . You are receiving this because you authored the thread.Message ID: @.***>

Nicj228 avatar Jul 02 '25 22:07 Nicj228

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms.

You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py

There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!

panosmoschos avatar Jul 02 '25 22:07 panosmoschos

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!

Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants.

If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.

OrkhanGG avatar Jul 02 '25 23:07 OrkhanGG

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!

Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants.

If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.

@OrkhanGG How is this going to work with the Gemini Live API, given that it includes built-in VAD-based turn detection, which is currently the only supported method? Is there any way to make it manual? can you provide an example.

Khalid1G avatar Jul 02 '25 23:07 Khalid1G

Hey @OrkhanGG, @theomonnom , Any updates or corresponding examples?

I initially achieved this using some very hacky methods by building my own agent implementation from scratch with LiveKit. However, with the new Gemini Live API model, the process is now much easier and more robust. I'm still in the process of refactoring the code, but if you try the "Gemini 2.5 Flash Preview Native Audio Dialog", things get even simpler (you can provide model name as 'gemini-2.5-flash-preview-native-audio-dialog'). It supports proactive dialogue and several additional parameters that can be useful for multi-participant audio rooms. You can check out this example to get started: https://github.com/livekit/agents/blob/main/examples/voice_agents/push_to_talk.py There are also some advanced possibilities you might explore, like detecting the active speaker and having the agent listen only to them. Supporting simultaneous audio and video input from all participants would be a huge win, though I don't think current LLMs can easily handle that out of the box. Any alternative solution would likely require enterprise-level effort to bring into production.

Is the code in the example somehow relatable to which model you are using? I mean if it works on Gemini can i make it work with openai with some fine tune or not? I am asking because currently our whole infra is on openai!

Yes, you can definitely do that. The example I shared above demonstrates how to set up an agent that focuses on a specific participant. The trick is to create the illusion that the agent is listening to all participants. If you don’t want the agent to focus on a specific participant based on client-side events, you can use something like active speaker detection instead. This might require a few extra steps to implement properly, but in my opinion, it's definitely worth considering and is very likely to work well with OpenAI's real-time voice API.

@OrkhanGG How is this going to work with the Gemini Live API, given that it includes built-in VAD-based turn detection, which is currently the only supported method? Is there any way to make it manual? can you provide an example.

The project I’m working on is private, but I’ll put together a public demo soon and share it here unless the LiveKit team beats me to it! Hopefully, it’ll clear up your main concern about the “two drivers, one steering wheel” (two VADs and one LLM) setup.

OrkhanGG avatar Jul 03 '25 03:07 OrkhanGG

Will be amazing have something like this, @OrkhanGG let us know if that works for you.

MatheusRDG avatar Jul 14 '25 20:07 MatheusRDG