Silence AudioFrame at the end of sentence triggered "Bot Start Speaking"
pipecat version
0.0.78
Python version
3.13
Operating System
Ubun tu
Use Case Description
I was debugging why a short answer by a user after a question of the bot was not triggered. I have discovered that unexpectedly the STT data identified was not used because the bot was supposedly in "Speaking" state. I must say that my answer was short but definitly after the bot has finished say the first sentence.
Looking at the logs, i have noticed a weird behaviour of the Frame BotStartSpeaking and BotStopSpeaking. After the last words of the sentence was said, i could observe the BotStopSpeaking Frame (as expected). However 1.5 later, again i can see another BotStartSpeaking and very quickly after the BotStopSpeaking. The TTS word was handled just after the 2nd BotStartSpeaking Frame was issued (leaving to its discard)
What was wrong is the 2nd BotStartingFrame triggered
Current Approach
I have tried to disable the silence that is sent in AudioContextWordTTSService._audio_context_task_handler and the behaviour disappeared.
Errors or Unexpected Behavior
A 2nd BotStartSpeakingFrame was triggered due to the silence added in AudioContextWordTTSService._audio_context_task_handler
async def _audio_context_task_handler(self):
"""In this task we process audio contexts in order."""
running = True
while running:
context_id = await self._contexts_queue.get()
if context_id:
# Process the audio context until the context doesn't have more
# audio available (i.e. we find None).
await self._handle_audio_context(context_id)
# We just finished processing the context, so we can safely remove it.
del self._contexts[context_id]
# Append some silence between sentences.
silence = b"\x00" * self.sample_rate
frame = TTSAudioRawFrame(
audio=silence, sample_rate=self.sample_rate, num_channels=1
)
logger.debug("AudioContextWordTTSService: Push silence between audio contexts")
await self.push_frame(frame)
else:
running = False
Additional Context
Logs:
025-09-22 19:01:43.892 | DEBUG | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1)
......
2025-09-22 19:01:43.971 | DEBUG | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1)
2025-09-22 19:01:44.532 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.3607361316680908 0.35
2025-09-22 19:01:44.535 | DEBUG | pipecat.transports.base_output:_bot_stopped_speaking:586 - Bot stopped speaking
2025-09-22 19:01:44.572 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.40029406547546387 0.35
2025-09-22 19:01:44.610 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call
....
2025-09-22 19:01:45.132 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.9602780342102051 0.35
2025-09-22 19:01:45.138 | DEBUG | pipecat.services.tts_service:_audio_context_task_handler:837 - AudioContextWordTTSService: Push silence between audio contexts
2025-09-22 19:01:45.139 | DEBUG | pipecat.observers.loggers.debug_log_observer:on_push_frame:217 - PatchedElevenLabsTTSService#0 → FastAPIWebsocketOutputTransport#0: TTSAudioRawFrame sample_rate: 8000, num_channels: 1, num_frames: 4000, id: 536, name: 'TTSAudioRawFrame#49', metadata: {} at 3.90s <---- GENERATION OF SILENCE
2025-09-22 19:01:45.170 | DEBUG | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1) <-------- THIS ONE SHOULD NOT HAPPEN
2025-09-22 19:01:45.171 | DEBUG | pipecat.transports.base_output:_bot_started_speaking:570 - Bot started speaking
.....
2025-09-22 19:01:45.610 | DEBUG | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1).
2025-09-22 19:01:45.971 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.3618021011352539 0.35
2025-09-22 19:01:45.973 | DEBUG | pipecat.transports.base_output:_bot_stopped_speaking:586 - Bot stopped speaking
2025-09-22 19:01:46.011 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.40149402618408203 0.35
2025-09-22 19:01:46.052 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.4419729709625244 0.35
2025-09-22 19:01:46.091 | DEBUG | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.4809720516204834 0.35
Modified code to see the logs (base_output.py) in attachment
@aconchillo @markbackman , probably it should be considered as a bug. WDYT? notify: @alexflorensa
can confirm this is an issue , using elevenlabs
This is a known weakness in Pipecat. The bot's speaking state is based on a VAD-like heuristic. We have plans to improve how this works. @aconchillo has this on his to do list.
Hi @markbackman @aconchillo, did you find any solution to this, I am facing the same issue.
There is a WIP branch that @aconchillo has but he's been busy working on other things. You can see this response for guidance to avoid this issue: https://github.com/pipecat-ai/pipecat/issues/3092#issuecomment-3568102113
Note HTTP services are most vulnerable to this issue since they can be slower to return generated audio.
@markbackman Changes suggested by @anotine10 might actually be better. Can we raise a PR for this? Do you see any downside to this?