agents
agents copied to clipboard
Guidance on Hybrid Integration of Gemini Realtime API with VoicePipelineAgent
Hi LiveKit team,
I’m building an AI phone calling system for small and medium businesses (SMBs) using LiveKit Agents, with a focus on supporting a language with gender-specific grammar and requiring robust speech handling in noisy environments. I’m integrating Google’s Gemini Realtime Multimodal API (gemini-2.0-flash) for its combined STT and LLM capabilities, paired with Google TTS for voice output. My attempts to fit this into VoicePipelineAgent have hit significant hurdles, and I’d appreciate your advice on the best approach to achieve low latency, persistent session management, and interruption handling.
I’ve chosen Gemini’s Realtime API over separate STT/LLM solutions for these reasons:
- Robust Speech Understanding: It excels in harsh environments (e.g., background noise), outperforming other STT options I’ve tested, which is critical for business calls.
- Emotional Tone Detection: It can interpret the user’s emotional state, vital for tailoring responses in a business context.
- Gender Recognition: In my language, many words are written identically but pronounced differently based on gender. Gemini detects male/female speakers, enabling my system instructions to address users correctly (e.g., adjusting grammar).
- Cost-Effectiveness: Combining STT and LLM in one model is the most economical solution I’ve found, reducing API costs significantly.
From what I understand, LiveKit Agents supports two main integration paths:
-
Realtime API (e.g., livekit.plugins.google.beta.realtime_api): Direct use of RealtimeModel and GeminiRealtimeSession for audio input and text output. Great for low-latency, unified STT/LLM processing, but lacks built-in TTS or interruption handling.
-
VoicePipelineAgent: Structured STT → LLM → TTS pipeline with VAD-based interruptions (e.g., using Silero VAD). Ideal for TTS integration and conversation flow, but assumes separate STT and LLM components.
I’m trying to create a hybrid:
- Use Gemini’s Realtime API for audio input, generating both transcription (for logging) and response (for TTS) in one stream.
- Leverage VoicePipelineAgent’s TTS and VAD-based interruption handling (e.g., user says “wait” to stop TTS).
My attempts to combine these approaches have been problematic:
-
Dummy STT + LLM-Driven: STT emits placeholders, LLM feeds audio to Gemini and streams responses.
-
Combined Output Parsing: Prompt Gemini to output (transcription) Response: response, parsed by adapters.
-
Separate STT/LLM Adapters: STT uses input_speech_transcription_completed, LLM uses response_content_added.
The core problem is losing VoicePipelineAgent’s interruption handling when leaning on Realtime’s unified output, but attempts to marry them raise more issues than they solve.
Options I’m Considering
- Custom Agent:
- Build a new agent directly using RealtimeModel, handling audio, Gemini events, TTS, and VAD interruptions.
- Pros: Low latency, full control over Gemini’s stream.
- Cons: High effort, loses pipeline features (e.g., function calls).
- Question: Is this a practical solution, or overkill for my use case?
- Separate STT/LLM Plugins:
- Modify livekit/plugins/google/ to add GoogleRealtimeSTT and GoogleRealtimeLLM, sharing one GeminiRealtimeSession tied to room lifecycle.
- Pros: Fits pipeline, cleaner session management.
- Cons: Potential session conflicts or event timing issues.
- Question: Is sharing a session across plugins viable? Any known risks with GeminiRealtimeSession?
- Refined Hybrid Parsing:
- Stick with adapters, improving buffering/parsing of Gemini’s combined output.
- Pros: Simple, uses pipeline.
- Cons: Fragile parsing, slight latency hit.
- Question: Can VoicePipelineAgent be adapted to handle unified STT/LLM output better?
I’d love your perspective on:
- Best Approach: Which option aligns best with LiveKit Agents for my hybrid needs? Any you’d recommend?
- Session Persistence: How should I keep a GeminiRealtimeSession open across a room’s lifecycle? Is a manager class (e.g., tied to JobContext) the right approach?
- Event Sync: Gemini lacks a “response complete” event—any tips for aligning its streaming output with VoicePipelineAgent’s flow?
- Pipeline Flexibility: Are there ways to tweak VoicePipelineAgent to support a combined STT/LLM model without breaking interruption handling?
- Alternatives: If VoicePipelineAgent isn’t ideal, what’s a lighter LiveKit abstraction for custom realtime AI flows with WebRTC?
Bump. I think this is an important question to address.
The realtime API solution captures other speech dimensions that STT can’t yet, but it’s not capable enough right now to operate alone unless we chain it with TTS or an LLM. This especially is true if the task instruction is 'heavy'.
Hey this is super important, and it's not a question; this a feature, and the feature is
Support the Gemini Realtime models, in Half Duplex, with Tool Calling
@longcw @JosephDahan @theomonnom
Gemini refers to this as "Half Cascade" but we might also call is "Half Duplex", and it is the recommended way to use Gemini realtime models when you need tool use to work.... which is for basically anything worth anything
Check this out:
Audio inputs and audio outputs negatively impact the model's ability to use function calling Google documentation of Limitations of Live API and Gemini 2.0
Half-cascade audio with Gemini 2.0 Flash: This option, available with the gemini-2.0-flash-live-001 model, uses a cascaded model architecture (native audio input and text-to-speech output).
It offers better performance and reliability in production environments, especially with TOOL USE Google docs on Live API / Realtime Models
Getting this right is not as simple as just setting AgentSession(llm=google.Realtime(..), tts=cartesia.TTS(...). The reason is that AgentSession conceptually either has the cascade style pipeline, or the realtime full-duplex pipeline, but definitely will not connect a provided tts to the output of any instance of RealtimeModel for any reason.
Further, if you do something like the below, which is kinda a reasonable way to "connect the LLM output to the TTS node", all kinds of things go haywire
session = AgentSession(
llm=realtime_model,
tts=cartesia.TTS()
)
@session.on("conversation_item_added")
def on_conversation_item(ev):
nonlocal session
...
if reasons:
# Use session TTS to say the output of the realtime_model
session.say(ev.item.content, add_to_chat_ctx=False)
I would definitely not consider this work complete until and unless there is a clean example in livekit/examples/agents
@amfleming realtime model (audio in, text out) with a separate TTS model is on our roadmap.
@longcw Will you also add text in, audio out to use the gemini realtime api as a TTS model along with some other STT model?
@tesla1900 we have a gemini TTS in beta #2834, but the speed of it is slow when I tested it.
@longcw Indeed, I am aware about gemini TTS and its slow output speed. I was talking about allowing the realtime model (gemini 2.5 flash native audio dialog) with text in, audio out--basically acting like a TTS model but with low latency because native audio dialog is pretty quick with speech generation.
Hi there @amfleming @longcw, May I ask if there's any update on LiveKit supporting Gemini Live in half-duplex mode? This is the only feature keeping us from migrating our agent to LiveKit 🤞 Thanks! Guido
@coccoinomane do you mean audio in -> llm output -> tts synthesis? if so we already support it. Here's an example: https://github.com/livekit/agents/blob/main/examples/voice_agents/realtime_with_tts.py