cookbook Two outputs from gemini-2.0-flash

Description of the feature request:

Hello everyone!

I want to make gemini-2.0-flash output to me both text and audio, but when I trying to add TEXT variable to response_modalities I get this kind of error : [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: 4ad28f357e6c292e, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language

What problem are you trying to solve with this feature?

Two outputs for one respons

Any other information you'd like to share?

I could not find any information related to this

Dec 20 '24 10:12 AleksNet5

Audio-out is only available to a selected few early access customers. You can only use the live api at the moment that only outputs audio.

Dec 21 '24 22:12 Giom-V

I second this feature request! I've been building a Unity game engine plugin for Gemini using the native audio feature. It would be incredibly useful to output both text and audio at the same time, and it would also spare hitting gemini twice with identical requests.

I suspect the reason it currently outputs only one or the other is because there probably isn't an intermediate text output head on multimodal llm model and there isn't an intermediate text representation when getting audio output?

If this is the case, and we can't expect matched text and audio output in the future, please let us know, as one can nowadays easily wire up a Whisper type speech to text model to get the text from audio, but of course this requires additional overhead.

Dec 29 '24 23:12 LarsDu

I'll route the feature request to the product team. I also agree that it would be great to get both but I'm not sure either of the feasability considering the model natively outputs audio.

Jan 06 '25 10:01 Giom-V

Interesting...

In December, I could only get audio or text from the live API (Vertex AI mode) but not both, but no error when requesting both...

As of last week up until yesterday, I was getting both text and audio from the live API by specifying response_modalities=['AUDIO', 'TEXT'] in the config and also emphasizing that I want both audio and text in the system instruction.

But then today, I ran the same code and got error 1007 (invalid frame payload data) and "generic::invalid_argument: Only one of text or audio output is allowed."

It's fun to experiment with a rapidly changing tool! But here's one vote for allowing us to continue to be able to get both audio and text. Otherwise, why would response_modalities be a list? 😃 And it was working fine yesterday, after all... 😸

Jan 16 '25 23:01 mdailey

For things like native image-out or audio-out but those features are not available yet.

Jan 20 '25 13:01 Giom-V

We're also looking to implement the Live API to return both the Audio and the Text in the response, and have not had luck specifying both in response_modalities=['AUDIO', 'TEXT'].

Feb 03 '25 19:02 mikecpeck

+1

Apr 28 '25 04:04 notibox2024

Is this working? I'm getting struggled to get both and save transcription when the interview finish. I got the text on playground but the user input keep overwritting the first message.

Jun 09 '25 18:06 MatheusRDG

Hi, we should have access to the audio transcription since the feature is already available. Please take a look.

Thanks

Jul 25 '25 07:07 Gunand3043

Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.

Aug 08 '25 22:08 github-actions[bot]

This issue was closed because it has been inactive for 27 days. Please post a new issue if you need further assistance. Thanks!

Aug 22 '25 22:08 github-actions[bot]