Response modalities doesn't work with gemini 2.5 live api
Description of the bug:
If you try to set response modalities to 'TEXT' it does not work, in addition there doesn't seem to be a response that includes text from the latest 2.5 live api model.
Actual vs expected behavior:
Allow the developer to specific "TEXT" or "AUDIO" or both for response modalities
Any other information you'd like to share?
No response
Encountered the same problem, may I ask if you have already solved it?
Having the same issue! It's pretty important to be able to get just text back from this API for what I'm building.
The “native audio” models are designed for voice interactions. If you require only text output, consider using the gemini-2.0-flash-live-001 model.
Thanks
2.0 vs 2.5 is a big difference in terms of performance, having text out w/ the quality we get from 2.5 is a pretty essential feature. Ideally it does both.
Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.
Hey @Gunand3043
config` = {"response_modalities": ["AUDIO"],
"output_audio_transcription": {},
"input_audio_transcription":{}
The transcription is working, but we are receiving the transcript in chunks (partial segments) rather than as a complete, final transcript.
Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?
Thanks in advance for your help!
Is there a specific parameter or setting we can use to get the full transcript as a single, finalized output instead of incremental partial transcriptions?
This is a live model, the goal is to provide you with the answers as fast as possible so it can start talking without waiting for the end of the message. So it has to use streaming and thus chunks.
Agree! Right now the difference is big enough for me that I'm setting to audio, taking the transcriptions and throwing away the audio, not using the 2.0 model.
Hey @michaellee1! When you do that, you do pay for audio output right? Which is quite more expensive from what I've seen in the pricing page https://ai.google.dev/gemini-api/docs/pricing#gemini-2.5-flash-native-audio. I still don't understand why can't we just have text output like with OpenAI models 😕