agents icon indicating copy to clipboard operation
agents copied to clipboard

The `gemini-2.5-flash-native-audio-preview-12-2025` model cannot be used with modalities text for hybrid architecture with a separate TTS plugin

Open sagorbrur opened this issue 3 weeks ago • 3 comments

Bug Description

Error message

websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.

Code to reproduce

from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality

session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        modalities=["text"],
    ),
    tts=<YOUR_CUSTOM_TTS>,  # e.g., elevenlabs.TTS(), deepgram.TTS()
    vad=silero.VAD.load(),
)

Expected Behavior

When setting modalities=[Modality.TEXT], the Gemini Live API should return text-only responses, allowing the agent to use a separate TTS plugin for speech synthesis (half-cascade architecture).

Reproduction Steps

from livekit.agents import AgentSession
from livekit.plugins import google
from livekit.plugins.google.realtime import Modality

session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        modalities=[Modality.TEXT],
    ),
    tts=<YOUR_CUSTOM_TTS>,  # e.g., elevenlabs.TTS(), deepgram.TTS()
    vad=silero.VAD.load(),
)

Operating System

Ubuntu 22.04

Models Used

Deepgram, Google, Elevenlab

Package Versions

livekit-agents==1.3.10

Session/Room/Call IDs

No response

Proposed Solution


Additional Context

No response

Screenshots and Recordings

No response

sagorbrur avatar Dec 31 '25 03:12 sagorbrur

@tinalenguyen #4414

sagorbrur avatar Dec 31 '25 03:12 sagorbrur

For Gemini API/AI studio: it throws the error received 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.; then sent 1007 (invalid frame payload data) Cannot extract voices from a non-audio request.

import asyncio

from dotenv import load_dotenv
from google import genai

load_dotenv(".env")

client = genai.Client(
    vertexai=False,
)

model = "gemini-2.5-flash-native-audio-preview-12-2025"
config = {"response_modalities": ["TEXT"]}


async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello, how are you?"
        await session.send_client_content(turns=message, turn_complete=True)


if __name__ == "__main__":
    asyncio.run(main())

For Vertex AI: text mode is no longer supported by native audio models: Text output is not supported for native audio output model.

import asyncio

from dotenv import load_dotenv
from google import genai

load_dotenv(".env")

client = genai.Client(
    vertexai=True,
    location="us-central1",
)

model = "gemini-live-2.5-flash-native-audio"
config = {"response_modalities": ["TEXT"]}


async def main():
    async with client.aio.live.connect(model=model, config=config) as session:
        message = "Hello, how are you?"
        await session.send_client_content(turns=message, turn_complete=True)


if __name__ == "__main__":
    asyncio.run(main())

I have reported a similar issue to Google, but it seems that they aren't going to fix it based on the latest error message. https://github.com/googleapis/python-genai/issues/1780

chenghao-mou avatar Dec 31 '25 08:12 chenghao-mou

Hi @chenghao-mou , Thanks for the response. Looking forward to the Google fix.

sagorbrur avatar Dec 31 '25 10:12 sagorbrur