cookbook Response modalities doesn't work with gemini 2.5 live api

Description of the bug:

If you try to set response modalities to 'TEXT' it does not work, in addition there doesn't seem to be a response that includes text from the latest 2.5 live api model.

Actual vs expected behavior:

Allow the developer to specific "TEXT" or "AUDIO" or both for response modalities

Any other information you'd like to share?

No response

May 27 '25 18:05 kdzapp-botco

Encountered the same problem, may I ask if you have already solved it?

Jun 03 '25 09:06 zish-rob-crur

Having the same issue! It's pretty important to be able to get just text back from this API for what I'm building.

Jun 03 '25 23:06 michaellee1

The “native audio” models are designed for voice interactions. If you require only text output, consider using the gemini-2.0-flash-live-001 model.

Thanks

Jun 05 '25 08:06 Gunand3043

2.0 vs 2.5 is a big difference in terms of performance, having text out w/ the quality we get from 2.5 is a pretty essential feature. Ideally it does both.

Jun 05 '25 17:06 kdzapp-botco

Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.

Jun 05 '25 17:06 michaellee1

Hey @Gunand3043

config` = {"response_modalities": ["AUDIO"],
       "output_audio_transcription": {},
      "input_audio_transcription":{}

The transcription is working, but we are receiving the transcript in chunks (partial segments) rather than as a complete, final transcript.

Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?

Thanks in advance for your help!

Jun 12 '25 14:06 vikramra-kore

Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?

This is a live model, the goal is to provide you with the answers as fast as possible so it can start talking without waiting for the end of the message. So it has to use streaming and thus chunks.

Aug 07 '25 12:08 Giom-V

Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.

Hey @michaellee1! When you do that, you do pay for audio output right? Which is quite more expensive from what I've seen in the pricing page https://ai.google.dev/gemini-api/docs/pricing#gemini-2.5-flash-native-audio. I still don't understand why can't we just have text output like with OpenAI models 😕

Oct 17 '25 15:10 2010b9