cookbook
cookbook copied to clipboard
Gemini 2.0 Live API does not respond for certain audios
Description of the bug:
Hi there,
I'm trying to run the Gemini 2.0 Live API
For certain audios, I call the send_audio_file function. Then, I call the receive_audio request does not return any audio in the websocket connection. This occurs in only certain audios.
Another thing I've been trying to tune is the system prompt, which I believe may be causing this behaviour. When I used the system_instruction in the setup message, the websocket closes. Here's the setup message I'm trying to send.
async def startup(self):
setup_msg = {"setup": {"model": f"models/{model}", "generation_config": {
"response_modalities": "AUDIO"
}}}
me = await self.ws.send_str(json.dumps(setup_msg))
return await self.ws.__anext__()
Any ideas on how to resolve would be helpful!
Actual vs expected behavior:
No response
Any other information you'd like to share?
Here's the code I'm using
async def startup(self):
setup_msg = {"setup": {"model": f"models/{model}", "generation_config": {
"response_modalities": "AUDIO"
}}}
me = await self.ws.send_str(json.dumps(setup_msg))
return await self.ws.__anext__()
async def send_text(self, text: str):
"""
Prompts user for text, then also prompts for an audio filename.
Sends the user's text as a user turn, and the file's audio content as a chunked upload.
"""
# Send text
msg = {
"client_content": {
"turn_complete": False, # We'll mark it complete after we send audio
"turns": [{"role": "user", "parts": [{"text": text}]}],
}
}
await self.ws.send_str(json.dumps(msg))
# Now indicate turn_complete
turn_complete_msg = {
"client_content": {
"turn_complete": True,
}
}
await self.ws.send_str(json.dumps(turn_complete_msg))
"""
"""
async def send_audio_file(self, filename: str):
"""
Reads the given audio file and splits it into CHUNK_SIZE segments.
Sends each segment as a media chunk to the server.
Adjust the chunking approach as you see fit.
"""
try:
with open(filename, "rb") as f:
while True:
data = f.read(CHUNK_SIZE)
if not data:
break
msg = {
"realtime_input": {
"media_chunks": [
{
"data": base64.b64encode(data).decode(),
"mime_type": "audio/pcm",
}
]
}
}
await self.ws.send_str(json.dumps(msg))
turn_complete_msg = {
"client_content": {
"turn_complete": True,
}
}
await self.ws.send_str(json.dumps(turn_complete_msg))
except FileNotFoundError:
print(f"Could not find file: {filename}")
except Exception as e:
print(f"Error reading file {filename}: {e}")
async def receive_audio(self, wav_file_path: str = ""):
"""
Reads from the websocket and extracts PCM data from the model,
placing it in audio_in_queue to be played locally.
"""
output_audio_byte = ""
async for raw_response in self.ws:
response = raw_response.json()
# Extract audio data
try:
b64data = response["serverContent"]["modelTurn"]["parts"][0][
"inlineData"
]["data"]
except KeyError:
pass
else:
output_audio_byte += b64data
# Turn complete
try:
turn_complete = response["serverContent"]["turnComplete"]
except KeyError:
pass
else:
if turn_complete:
audio_chunk = base64.b64decode(output_audio_byte)
if len(audio_chunk) == 0:
continue
wav_file = wave.open(wav_file_path, "wb")
wav_file.setnchannels(1) # Mono
wav_file.setsampwidth(2) # 16-bit
wav_file.setframerate(24000) # Sample rate
wav_file.writeframes(audio_chunk)
return
From a quick look, I don't see any obvious errors in that code. Are you able to share the whole script? I see some similarity with gemini-2.0/websockets/live_starter.py but it looks different enough that it'd help to have the whole script.
You also mentioned This occurs in only certain audios. - are you able to share the audio files too?
@markmcd - I've found that what's happening is the VAD / endpoint detection is not happening properly for those audios. The model appears to want more audio before giving a response. Do you have any idea how to pass an explicit flag to force the model to respond? I have tried with the turn complete flag and it does not help
turn_complete_msg = {
"client_content": {
"turn_complete": True,
}
}
await self.ws.send_str(json.dumps(turn_complete_msg))`
Ah interesting - turn_complete is the way to be explicit about it. Can you provide any audio? Or if that's not possible, can you describe what it's like (e.g. human says "hello" then 30 seconds of silence)
@markmcd - I uploaded the file here.
https://limewire.com/d/08fd05d0-5101-4468-bfa1-8676d9b162ec#PRQH35I8Vt2NL1nU3plDOTKq-tAXK5Dly44r7wIF-ac
The gist is human says "for instance one time it uh it had been raining several days, and this one kid he gave me his last pair of dry socks. put them in my pocket"
Hi,
Nit: It's best not to mix realtime_input and client_content.
Sending the audio through realtime_input relies on the builtin Voice Activity Detection VAD.
If you're sending an audio file, there may not be enough silence at the end of the file to trigger a response.
Probably the best thing to do here, from https://github.com/google-gemini/cookbook/issues/795, is to switch VAD to manual Mode:
https://colab.sandbox.google.com/drive/1vEMe7UcErgu-FaqSiMum9UN-n8aqhIXR#scrollTo=SZuJOctTWGzu