cookbook Get_started_LiveAPI.py example stops understanding images

Description of the bug:

I was using Get_started_LiveAPI.py to play with live API but this week I suddenly noticed it doesn't understand images anymore, I tried to make sure images properly captured from my webcam so I added writing disk before putting in the queue to be send to model but still it doesn't work and model says it doesn't see.

do you see my camera? message > As a large language model, I don't have a physical body or the ability to interact with the physical world. Therefore, I cannot see your camera. I exist only as computer code.

Actual vs expected behavior:

No response

Any other information you'd like to share?

No response

Apr 14 '25 18:04 rezacopol

ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?

Apr 15 '25 19:04 rezacopol

I don't understand anything so please tell me what is going on

On Tue, Apr 15, 2025, 3:14 p.m. rezacopol @.***> wrote:

ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?

— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ3VMV4QI63SOUYW7RUDVJT2ZVLAFAVCNFSM6AAAAAB3DVPM5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBXGIZDSOBWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***> rezacopol left a comment (google-gemini/cookbook#714) https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860

ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?

— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ3VMV4QI63SOUYW7RUDVJT2ZVLAFAVCNFSM6AAAAAB3DVPM5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBXGIZDSOBWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Apr 15 '25 19:04 Asjleyam69

to replicate:

in Get_started_LiveAPI.py Change the CONFIG = {"response_modalities": ["AUDIO"]} to CONFIG = {"response_modalities": ["TEXT"]} run the code, and ask what do you see or describe the scene, it often comes back as As a large language model, I don't have a physical body or the ability to interact with the physical world. Therefore, I cannot see your camera. I exist only as computer code.

Apr 15 '25 20:04 rezacopol

Also experiencing this, any updates? This is blocking for the use case we're interested in (image + audio input, text output)

May 14 '25 20:05 kdzapp-botco

I think it relates to rate limiting since it is working sometimes and not other times.

May 14 '25 23:05 rezacopol

hey @rezacopol , By default the model only "sees" video frames paired with speech / audio. To send all video frames with text, you can set turn_coverage to TURN_INCLUDES_ALL_INPUT and only send video frames from client when text is sent.

Note: this will fill up context much faster and increase costs but should allow the model to "see" video frames with text.

May 26 '25 16:05 Gunand3043

Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.

Jun 09 '25 22:06 github-actions[bot]

This issue was closed because it has been inactive for 27 days. Please post a new issue if you need further assistance. Thanks!

Jun 23 '25 22:06 github-actions[bot]

Same issue

Aug 05 '25 21:08 Sofianel5