Get_started_LiveAPI.py example stops understanding images
Description of the bug:
I was using Get_started_LiveAPI.py to play with live API but this week I suddenly noticed it doesn't understand images anymore, I tried to make sure images properly captured from my webcam so I added writing disk before putting in the queue to be send to model but still it doesn't work and model says it doesn't see.
do you see my camera? message > As a large language model, I don't have a physical body or the ability to interact with the physical world. Therefore, I cannot see your camera. I exist only as computer code.
Actual vs expected behavior:
No response
Any other information you'd like to share?
No response
ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?
I don't understand anything so please tell me what is going on
On Tue, Apr 15, 2025, 3:14 p.m. rezacopol @.***> wrote:
ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?
— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ3VMV4QI63SOUYW7RUDVJT2ZVLAFAVCNFSM6AAAAAB3DVPM5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBXGIZDSOBWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***> rezacopol left a comment (google-gemini/cookbook#714) https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860
ok found one important piece of information, the code works when I use CONFIG = {"response_modalities": ["AUDIO"]} but it stops working (model say I don't see anything) after changing the response_modalities to TEXT, why is that?
— Reply to this email directly, view it on GitHub https://github.com/google-gemini/cookbook/issues/714#issuecomment-2807229860, or unsubscribe https://github.com/notifications/unsubscribe-auth/BJ3VMV4QI63SOUYW7RUDVJT2ZVLAFAVCNFSM6AAAAAB3DVPM5GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQMBXGIZDSOBWGA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
to replicate:
in Get_started_LiveAPI.py Change the CONFIG = {"response_modalities": ["AUDIO"]} to CONFIG = {"response_modalities": ["TEXT"]} run the code, and ask what do you see or describe the scene, it often comes back as As a large language model, I don't have a physical body or the ability to interact with the physical world. Therefore, I cannot see your camera. I exist only as computer code.
Also experiencing this, any updates? This is blocking for the use case we're interested in (image + audio input, text output)
I think it relates to rate limiting since it is working sometimes and not other times.
hey @rezacopol , By default the model only "sees" video frames paired with speech / audio. To send all video frames with text, you can set turn_coverage to TURN_INCLUDES_ALL_INPUT and only send video frames from client when text is sent.
Note: this will fill up context much faster and increase costs but should allow the model to "see" video frames with text.
Marking this issue as stale since it has been open for 14 days with no activity. This issue will be closed if no further activity occurs.
This issue was closed because it has been inactive for 27 days. Please post a new issue if you need further assistance. Thanks!
Same issue