cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

I get Error with CONFIG = {"generation_config": {"response_modalities": ["AUDIO","TEXT"]}} in gemini-2/live_api_starter.py

Open abdul7235 opened this issue 11 months ago • 19 comments

Description of the bug:

From the documentation:

https://ai.google.dev/api/multimodal-live

I believe I can get response in multiple modalilties, but running the above CONFIG in my code I get following error:

websockets.exceptions.ConnectionClosedError: received 1007 (invalid frame payload data) Request trace id: a0bb7a2dd8834b47, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language; then sent 1007 (invalid frame payload data) Request trace id: a0bb7a2dd8834b47, [ORIGINAL ERROR] generic::invalid_argument: Error in program Instantiation for language

Why am I getting this error?

Actual vs expected behavior:

Is it possible to have response in Audio and Text Simaltaneoulsy?

If yes, please help me sort it out.

If no, then for good's sake it must be mentioned clearly in the documentation!

Any other information you'd like to share?

I had appreciate if google and GCP had properly organized and clear documentation for their services, APIs and SDKs. It is such a horrible experience integrating google's services due to documentation being horribly scattered and vague.

abdul7235 avatar Dec 27 '24 14:12 abdul7235

I'm getting the same error. It would be nice to get both text and audio at the same time. This is particularly useful for generating dialogues for things like games...

LarsDu avatar Dec 29 '24 22:12 LarsDu

Hello @abdul7235 and @LarsDu,

At the moment, multimodalities is not available publically. You can only get Audio using the live APIs, and text using the "classic" ones.

I'll see what can be done to make that clear in the documentation.

Giom-V avatar Dec 30 '24 14:12 Giom-V

@Giom-V

Will multimodal output be available in the coming months?

Also, could you guide me on how to gain access to the non-public multimodal API?

abdul7235 avatar Dec 30 '24 16:12 abdul7235

Is it possible to get the Audio and function calling? @Giom-V

kshitij01042002 avatar Jan 03 '25 11:01 kshitij01042002

this is when you sent unsafe prompts
im looking how to block it , i tried to use this and doesnt work

self.safety_settings = [ { "category": "HARM_CATEGORY_DANGEROUS", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE", }, { "category": "HARM_CATEGORY_SEXUAL", "threshold": "BLOCK_NONE", } ]

simix avatar Jan 04 '25 11:01 simix

now this one works for me :

self.safety_settings = [ { "category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE" }, { "category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE" } ],

simix avatar Jan 04 '25 11:01 simix

I am getting this error, any idea why I might be getting this?

Error in Gemini session: received 1007 (invalid frame payload data) Request trace id: 4bdc******bb063, No matching per-key-config for API key_salt: 7448*****67170175; then sent 1007 (invalid frame payload data) Request trace id: 4bdc6000403bb063, No matching per-key-config for API key_salt: 74484*****170175

kshitij01042002 avatar Jan 06 '25 05:01 kshitij01042002

Will multimodal output be available in the coming months?

Yes it will.

Also, could you guide me on how to gain access to the non-public multimodal API?

@abdul7235, sorry but this is not a progra; you can request to join. The only way is to be very active (like GDEs) and we will reach out to you.

Giom-V avatar Jan 06 '25 13:01 Giom-V

Is it possible to get the Audio and function calling? @Giom-V

@kshitij01042002 This notebook will show you how to use function calling with the live API: https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_tool_use.ipynb

Giom-V avatar Jan 06 '25 13:01 Giom-V

I am getting this error, any idea why I might be getting this?

Error in Gemini session: received 1007 (invalid frame payload data) Request trace id: 4bdc******bb063, No matching per-key-config for API key_salt: 7448*****67170175; then sent 1007 (invalid frame payload data) Request trace id: 4bdc6000403bb063, No matching per-key-config for API key_salt: 74484*****170175

@kshitij01042002 I'm guesssing you're not using the right API key. It should start with "AIza..." which is not the case of the one you seem to be using.

You need to generate it on AI Studio as documented here.

Giom-V avatar Jan 06 '25 13:01 Giom-V

this is when you sent unsafe prompts im looking how to block it , i tried to use this and doesnt work

@simix I think your mistake is that your were using HARM_CATEGORY_DANGEROUS instead of HARM_CATEGORY_DANGEROUS_CONTENT. Here's the related documentation for reference.

Giom-V avatar Jan 06 '25 13:01 Giom-V

@Giom-V I just need to re confirm that will I be able to get response from gemini in Audio + Text in the upcoming version?

E.g If I ask "Hello Gemini tell me about the weather." Can I get the response in Audio and the same thing that gemini is speaking in text too? I mean I need the same response in both audio and text.

abdul7235 avatar Jan 15 '25 05:01 abdul7235

@abdul7235 I don't it will be possible as Gemini generates the audio output directly, without using a TTS mechanism. If you want both (and I can see why) I think you'll have to generate text then use a TTS service to generate the audio.

Giom-V avatar Jan 15 '25 09:01 Giom-V

With the live-api on websockets [1], is there a way to adjust the safety params [2]? I couldn't see it in the source code of the python lib or the docs @Giom-V

[1] https://github.com/google-gemini/cookbook/blob/main/gemini-2/live_api_tool_use.ipynb [2] https://ai.google.dev/api/generate-content#v1beta.HarmCategory

ArthurG avatar Jan 17 '25 22:01 ArthurG

@ArthurG I don't think you can at the moment.

Giom-V avatar Jan 20 '25 13:01 Giom-V

@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.

abdul7235 avatar Jan 29 '25 06:01 abdul7235

@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.

abdul7235 that's a good question. However, "Gemini 2.0 Flash is a Preview offering, subject to the "Pre-GA Offerings Terms" of the Google Cloud Service Specific Terms."

You can read more here: https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2

hansdaapurba avatar Jan 29 '25 07:01 hansdaapurba

@Giom-V I want to use Multimodal Live API and Gemini 2.0 at scale in production, Can you please give me a clearer understanding of the roadmap and timeline for a production-ready version of this project.

I'm sorry but we don't have a public roadmap yet. The model is for now only experimental as we're still gathering interest for it like yours.

Giom-V avatar Jan 30 '25 18:01 Giom-V

@Giom-V , But should function calling works with response_modalities: [AUDIO] . When I try to do so it seem the model understands what it should do but I always get this error:

  |   File "/home/bva/src/thdevelop/src/thebot/gemini_live_api_v2.py", line 289, in run
  |     async with (
  |   File "/usr/lib/python3.12/asyncio/taskgroups.py", line 145, in __aexit__
  |     raise me from None
  | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
  +-+---------------- 1 ----------------
    | Traceback (most recent call last):
    |   File "/home/bva/src/thdevelop/src/thebot/gemini_live_api_v2.py", line 260, in receive_audio
    |     async for response in turn:
    |   File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/live.py", line 109, in receive
    |     while result := await self._receive():
    |                     ^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/live.py", line 190, in _receive
    |     return types.LiveServerMessage._from_response(
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/google/genai/_common.py", line 203, in _from_response
    |     validated_response = cls.model_validate(response)
    |                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    |   File "/home/bva/src/thdevelop/src/venv/lib/python3.12/site-packages/pydantic/main.py", line 703, in model_validate
    |     return cls.__pydantic_validator__.validate_python(
    |            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    | pydantic_core._pydantic_core.ValidationError: 1 validation error for LiveServerMessage
    | tool_call_cancellation.ids.0
    |   Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='function-call-4093409504350677558', input_type=str]
    |     For further information visit https://errors.pydantic.dev/2.11/v/int_parsing
    +------------------------------------

I have simple function declaration:

save_text = {
    "name": "save_text",
    "description": "Save all conversation to text.",
    "parameters": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "Text representation of audio conversation.",
            },
        },
        "required": ["text"],
    }
}

and I am asking model to save our conversation to file. Should it work or not in case AUDIO only?

vitalek84 avatar Apr 12 '25 16:04 vitalek84