The `gemini-2.5-flash-native-audio-preview-12-2025` model cannot be used with modalities text for hybrid architecture with a separate TTS plugin
Bug Description
Error message
websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.
Code to reproduce
from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
modalities=["text"],
),
tts=<YOUR_CUSTOM_TTS>, # e.g., elevenlabs.TTS(), deepgram.TTS()
vad=silero.VAD.load(),
)
Expected Behavior
When setting modalities=[Modality.TEXT], the Gemini Live API should return text-only responses, allowing the agent to use a separate TTS plugin for speech synthesis (half-cascade architecture).
Reproduction Steps
from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality
session = AgentSession(
llm=google.realtime.RealtimeModel(
model="gemini-2.5-flash-native-audio-preview-12-2025",
modalities=[Modality.TEXT],
),
tts=<YOUR_CUSTOM_TTS>, # e.g., elevenlabs.TTS(), deepgram.TTS()
vad=silero.VAD.load(),
)
Operating System
Ubuntu 22.04
Models Used
Deepgram, Google, Elevenlab
Package Versions
livekit-agents==1.3.10
Session/Room/Call IDs
No response
Proposed Solution
Additional Context
No response
Screenshots and Recordings
No response
@tinalenguyen #4414
For Gemini API/AI studio: it throws the error received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.; then sent 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.
import asyncio
from dotenv import load_dotenv
from google import genai
load_dotenv(".env")
client = genai.Client(
vertexai=False,
)
model = "gemini-2.5-flash-native-audio-preview-12-2025"
config = {"response_modalities": ["TEXT"]}
async def main():
async with client.aio.live.connect(model=model, config=config) as session:
message = "Hello, how are you?"
await session.send_client_content(turns=message, turn_complete=True)
if __name__ == "__main__":
asyncio.run(main())
For Vertex AI: text mode is no longer supported by native audio models: Text output is not supported for native audio output model.
import asyncio
from dotenv import load_dotenv
from google import genai
load_dotenv(".env")
client = genai.Client(
vertexai=True,
location="us-central1",
)
model = "gemini-live-2.5-flash-native-audio"
config = {"response_modalities": ["TEXT"]}
async def main():
async with client.aio.live.connect(model=model, config=config) as session:
message = "Hello, how are you?"
await session.send_client_content(turns=message, turn_complete=True)
if __name__ == "__main__":
asyncio.run(main())
I have reported a similar issue to Google, but it seems that they aren't going to fix it based on the latest error message. https://github.com/googleapis/python-genai/issues/1780
Hi @chenghao-mou , Thanks for the response. Looking forward to the Google fix.