cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Response modalities doesn't work with gemini 2.5 live api

Open kdzapp-botco opened this issue 6 months ago • 6 comments

Description of the bug:

If you try to set response modalities to 'TEXT' it does not work, in addition there doesn't seem to be a response that includes text from the latest 2.5 live api model.

Actual vs expected behavior:

Allow the developer to specific "TEXT" or "AUDIO" or both for response modalities

Any other information you'd like to share?

No response

kdzapp-botco avatar May 27 '25 18:05 kdzapp-botco

Encountered the same problem, may I ask if you have already solved it?

zish-rob-crur avatar Jun 03 '25 09:06 zish-rob-crur

Having the same issue! It's pretty important to be able to get just text back from this API for what I'm building.

michaellee1 avatar Jun 03 '25 23:06 michaellee1

The “native audio” models are designed for voice interactions. If you require only text output, consider using the gemini-2.0-flash-live-001 model.

Thanks

Gunand3043 avatar Jun 05 '25 08:06 Gunand3043

2.0 vs 2.5 is a big difference in terms of performance, having text out w/ the quality we get from 2.5 is a pretty essential feature. Ideally it does both.

kdzapp-botco avatar Jun 05 '25 17:06 kdzapp-botco

Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.

michaellee1 avatar Jun 05 '25 17:06 michaellee1

Hey @Gunand3043

config` = {"response_modalities": ["AUDIO"],
       "output_audio_transcription": {},
      "input_audio_transcription":{}

The transcription is working, but we are receiving the transcript in chunks (partial segments) rather than as a complete, final transcript.

Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?

Thanks in advance for your help!

vikramra-kore avatar Jun 12 '25 14:06 vikramra-kore

Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?

This is a live model, the goal is to provide you with the answers as fast as possible so it can start talking without waiting for the end of the message. So it has to use streaming and thus chunks.

Giom-V avatar Aug 07 '25 12:08 Giom-V

Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.

Hey @michaellee1! When you do that, you do pay for audio output right? Which is quite more expensive from what I've seen in the pricing page https://ai.google.dev/gemini-api/docs/pricing#gemini-2.5-flash-native-audio. I still don't understand why can't we just have text output like with OpenAI models 😕

2010b9 avatar Oct 17 '25 15:10 2010b9