Two outputs from gemini-2.0-flash
Description of the feature request:
Hello everyone!
I want to make gemini-2.0-flash output to me both text and audio, but when I trying to add TEXT variable to response_modalities I get this kind of error : [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: 4ad28f357e6c292e, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language
What problem are you trying to solve with this feature?
Two outputs for one respons
Any other information you'd like to share?
I could not find any information related to this
Audio-out is only available to a selected few early access customers. You can only use the live api at the moment that only outputs audio.
I second this feature request! I've been building a Unity game engine plugin for Gemini using the native audio feature. It would be incredibly useful to output both text and audio at the same time, and it would also spare hitting gemini twice with identical requests.
I suspect the reason it currently outputs only one or the other is because there probably isn't an intermediate text output head on multimodal llm model and there isn't an intermediate text representation when getting audio output?
If this is the case, and we can't expect matched text and audio output in the future, please let us know, as one can nowadays easily wire up a Whisper type speech to text model to get the text from audio, but of course this requires additional overhead.
I'll route the feature request to the product team. I also agree that it would be great to get both but I'm not sure either of the feasability considering the model natively outputs audio.
Interesting...
In December, I could only get audio or text from the live API (Vertex AI mode) but not both, but no error when requesting both...
As of last week up until yesterday, I was getting both text and audio from the live API by specifying response_modalities=['AUDIO', 'TEXT'] in the config and also emphasizing that I want both audio and text in the system instruction.
But then today, I ran the same code and got error 1007 (invalid frame payload data) and "generic::invalid_argument: Only one of text or audio output is allowed."
It's fun to experiment with a rapidly changing tool! But here's one vote for allowing us to continue to be able to get both audio and text. Otherwise, why would response_modalities be a list? 😃 And it was working fine yesterday, after all... 😸
We're also looking to implement the Live API to return both the Audio and the Text in the response, and have not had luck specifying both in response_modalities=['AUDIO', 'TEXT'].
+1
Is this working? I'm getting struggled to get both and save transcription when the interview finish. I got the text on playground but the user input keep overwritting the first message.
Hi, we should have access to the audio transcription since the feature is already available. Please take a look.
Thanks
Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.
This issue was closed because it has been inactive for 27 days. Please post a new issue if you need further assistance. Thanks!