I get Error with CONFIG = {"generation_config": {"response_modalities": ["AUDIO","TEXT"]}} in gemini-2/live_api_starter.py
Description of the bug:
From the documentation:
https://ai.google.dev/api/multimodal-live
I believe I can get response in multiple modalilties, but running the above CONFIG in my code I get following error:
websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Request trace id: a0bb7a2dd8834b47, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: a0bb7a2dd8834b47, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language
Why am I getting this error?
Actual vs expected behavior:
Is it possible to have response in Audio and Text Simaltaneoulsy?
If yes, please help me sort it out.
If no, then for good's sake it must be mentioned clearly in the documentation!
Any other information you'd like to share?
I had appreciate if google and GCP had properly organized and clear documentation for their services, APIs and SDKs. It is such a horrible experience integrating google's services due to documentation being horribly scattered and vague.
I'm getting the same error. It would be nice to get both text and audio at the same time. This is particularly useful for generating dialogues for things like games...
Hello @abdul7235 and @LarsDu,
At the moment, multimodalities is not available publically. You can only get Audio using the live APIs, and text using the "classic" ones.
I'll see what can be done to make that clear in the documentation.
@Giom-V
Will multimodal output be available in the coming months?
Also, could you guide me on how to gain access to the non-public multimodal API?
Is it possible to get the Audio and function calling? @Giom-V
this is when you sent unsafe prompts
im looking how to block it , i tried to use this and doesnt work
self.safety_settings = [ { "category": "HARM_CATEGORY_DANGEROUS", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_SEXUAL", "threshold": "BLOCK_NONE", } ]
now this one works for me :
self.safety_settings = [ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" } ],
I am getting this error, any idea why I might be getting this?
Error in Gemini session: received 1007 (invalid frame payload data) Request trace id: 4bdc******bb063, No matching per-key-config for API key_salt: 7448*****67170175; then sent 1007 (invalid frame payload data) Request trace id: 4bdc6000403bb063, No matching per-key-config for API key_salt: 74484*****170175
Will multimodal output be available in the coming months?
Yes it will.
Also, could you guide me on how to gain access to the non-public multimodal API?
@abdul7235, sorry but this is not a progra; you can request to join. The only way is to be very active (like GDEs) and we will reach out to you.
Is it possible to get the Audio and function calling? @Giom-V
@kshitij01042002 This notebook will show you how to use function calling with the live API: https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_tool_use.ipynb
I am getting this error, any idea why I might be getting this?
Error in Gemini session: received 1007 (invalid frame payload data) Request trace id: 4bdc******bb063, No matching per-key-config for API key_salt: 7448*****67170175; then sent 1007 (invalid frame payload data) Request trace id: 4bdc6000403bb063, No matching per-key-config for API key_salt: 74484*****170175
@kshitij01042002 I'm guesssing you're not using the right API key. It should start with "AIza..." which is not the case of the one you seem to be using.
this is when you sent unsafe prompts im looking how to block it , i tried to use this and doesnt work
@simix I think your mistake is that your were using HARM_CATEGORY_DANGEROUS instead of HARM_CATEGORY_DANGEROUS_CONTENT. Here's the related documentation for reference.
@Giom-V I just need to re confirm that will I be able to get response from gemini in Audio + Text in the upcoming version?
E.g If I ask "Hello Gemini tell me about the weather." Can I get the response in Audio and the same thing that gemini is speaking in text too? I mean I need the same response in both audio and text.
@abdul7235 I don't it will be possible as Gemini generates the audio output directly, without using a TTS mechanism. If you want both (and I can see why) I think you'll have to generate text then use a TTS service to generate the audio.
With the live-api on websockets [1], is there a way to adjust the safety params [2]? I couldn't see it in the source code of the python lib or the docs @Giom-V
[1] https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_tool_use.ipynb [2] https://ai.google.dev/api/generate-content#v1beta.HarmCategory
@ArthurG I don't think you can at the moment.
@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.
@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.
abdul7235 that's a good question. However, "Gemini 2.0 Flash is a Preview offering, subject to the "Pre-GA Offerings Terms" of the Google Cloud Service Specific Terms."
You can read more here: https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2
@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.
I'm sorry but we don't have a public roadmap yet. The model is for now only experimental as we're still gathering interest for it like yours.
@Giom-V , But should function calling works with response_modalities: [AUDIO] . When I try to do so it seem the model understands what it should do but I always get this error:
| File "/home/bva/src/thdevelop/src/thebot/gemini_live_api_v2.py", line 289, in run
| async with (
| File "/usr/lib/python3.12/asyncio/taskgroups.py", line 145, in __aexit__
| raise me from None
| ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
+-+---------------- 1 ----------------
| Traceback (most recent call last):
| File "/home/bva/src/thdevelop/src/thebot/gemini_live_api_v2.py", line 260, in receive_audio
| async for response in turn:
| File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/live.py", line 109, in receive
| while result := await self._receive():
| ^^^^^^^^^^^^^^^^^^^^^
| File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/live.py", line 190, in _receive
| return types.LiveServerMessage._from_response(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/_common.py", line 203, in _from_response
| validated_response = cls.model_validate(response)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/pydantic/main.py", line 703, in model_validate
| return cls.__pydantic_validator__.validate_python(
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
| pydantic_core._pydantic_core.ValidationError: 1 validation error for LiveServerMessage
| tool_call_cancellation.ids.0
| Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='function-call-4093409504350677558', input_type=str]
| For further information visit https://errors.pydantic.dev/2.11/v/int_parsing
+------------------------------------
I have simple function declaration:
save_text = {
"name": "save_text",
"description": "Save all conversation to text.",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "Text representation of audio conversation.",
},
},
"required": ["text"],
}
}
and I am asking model to save our conversation to file. Should it work or not in case AUDIO only?