cookbook icon indicating copy to clipboard operation
cookbook copied to clipboard

Gemini 2.0 Live API does not respond for certain audios

Open ArthurG opened this issue 10 months ago • 4 comments
trafficstars

Description of the bug:

Hi there,

I'm trying to run the Gemini 2.0 Live API

For certain audios, I call the send_audio_file function. Then, I call the receive_audio request does not return any audio in the websocket connection. This occurs in only certain audios.

Another thing I've been trying to tune is the system prompt, which I believe may be causing this behaviour. When I used the system_instruction in the setup message, the websocket closes. Here's the setup message I'm trying to send.

    async def startup(self):
        setup_msg = {"setup": {"model": f"models/{model}",    "generation_config": {
            "response_modalities": "AUDIO"
        }}}
        me = await self.ws.send_str(json.dumps(setup_msg))
        return await self.ws.__anext__()

Any ideas on how to resolve would be helpful!

Actual vs expected behavior:

No response

Any other information you'd like to share?

Here's the code I'm using

    async def startup(self):
        setup_msg = {"setup": {"model": f"models/{model}",    "generation_config": {
            "response_modalities": "AUDIO"
        }}}
        me = await self.ws.send_str(json.dumps(setup_msg))
        return await self.ws.__anext__()

    async def send_text(self, text: str):
        """
        Prompts user for text, then also prompts for an audio filename.
        Sends the user's text as a user turn, and the file's audio content as a chunked upload.
        """
        # Send text
        msg = {
            "client_content": {
                "turn_complete": False,  # We'll mark it complete after we send audio
                "turns": [{"role": "user", "parts": [{"text": text}]}],
            }
        }
        await self.ws.send_str(json.dumps(msg))
        # Now indicate turn_complete
        turn_complete_msg = {
            "client_content": {
                "turn_complete": True,
            }
        }
        await self.ws.send_str(json.dumps(turn_complete_msg))
        """
        """

    async def send_audio_file(self, filename: str):
        """
        Reads the given audio file and splits it into CHUNK_SIZE segments.
        Sends each segment as a media chunk to the server.
        Adjust the chunking approach as you see fit.
        """
        try:
            with open(filename, "rb") as f:
                while True:
                    data = f.read(CHUNK_SIZE)
                    if not data:
                        break
                    msg = {

                        "realtime_input": {
                            "media_chunks": [
                                {
                                    "data": base64.b64encode(data).decode(),
                                    "mime_type": "audio/pcm",
                                }
                            ]
                        }
                    }
                    await self.ws.send_str(json.dumps(msg))

            turn_complete_msg = {
                "client_content": {
                    "turn_complete": True,
                }
            }
            await self.ws.send_str(json.dumps(turn_complete_msg))
        except FileNotFoundError:
            print(f"Could not find file: {filename}")
        except Exception as e:
            print(f"Error reading file {filename}: {e}")

    async def receive_audio(self, wav_file_path: str = ""):
        """
        Reads from the websocket and extracts PCM data from the model,
        placing it in audio_in_queue to be played locally.
        """

        output_audio_byte = ""

        async for raw_response in self.ws:

            response = raw_response.json()
            # Extract audio data
            try:
                b64data = response["serverContent"]["modelTurn"]["parts"][0][
                    "inlineData"
                ]["data"]
            except KeyError:
                pass
            else:
                output_audio_byte += b64data

            # Turn complete
            try:
                turn_complete = response["serverContent"]["turnComplete"]
            except KeyError:
                pass
            else:
                if turn_complete:
                    audio_chunk = base64.b64decode(output_audio_byte)
                    if len(audio_chunk) == 0:
                        continue
                    wav_file = wave.open(wav_file_path, "wb")
                    wav_file.setnchannels(1)  # Mono
                    wav_file.setsampwidth(2)  # 16-bit
                    wav_file.setframerate(24000)  # Sample rate
                    wav_file.writeframes(audio_chunk)
                    return

ArthurG avatar Jan 17 '25 22:01 ArthurG

From a quick look, I don't see any obvious errors in that code. Are you able to share the whole script? I see some similarity with gemini-2.0/websockets/live_starter.py but it looks different enough that it'd help to have the whole script.

You also mentioned This occurs in only certain audios. - are you able to share the audio files too?

markmcd avatar Jan 21 '25 07:01 markmcd

@markmcd - I've found that what's happening is the VAD / endpoint detection is not happening properly for those audios. The model appears to want more audio before giving a response. Do you have any idea how to pass an explicit flag to force the model to respond? I have tried with the turn complete flag and it does not help

            turn_complete_msg = {
                "client_content": {
                    "turn_complete": True,
                }
            }
            await self.ws.send_str(json.dumps(turn_complete_msg))`

ArthurG avatar Jan 22 '25 00:01 ArthurG

Ah interesting - turn_complete is the way to be explicit about it. Can you provide any audio? Or if that's not possible, can you describe what it's like (e.g. human says "hello" then 30 seconds of silence)

markmcd avatar Feb 05 '25 05:02 markmcd

@markmcd - I uploaded the file here.

https://limewire.com/d/08fd05d0-5101-4468-bfa1-8676d9b162ec#PRQH35I8Vt2NL1nU3plDOTKq-tAXK5Dly44r7wIF-ac

The gist is human says "for instance one time it uh it had been raining several days, and this one kid he gave me his last pair of dry socks. put them in my pocket"

ArthurG avatar Feb 05 '25 18:02 ArthurG

Hi,

Nit: It's best not to mix realtime_input and client_content.

Sending the audio through realtime_input relies on the builtin Voice Activity Detection VAD.

If you're sending an audio file, there may not be enough silence at the end of the file to trigger a response.

Probably the best thing to do here, from https://github.com/google-gemini/cookbook/issues/795, is to switch VAD to manual Mode:

https://colab.sandbox.google.com/drive/1vEMe7UcErgu-FaqSiMum9UN-n8aqhIXR#scrollTo=SZuJOctTWGzu

MarkDaoust avatar Jun 26 '25 21:06 MarkDaoust