pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

Silence AudioFrame at the end of sentence triggered "Bot Start Speaking"

Open anotine10 opened this issue 3 months ago • 5 comments

pipecat version

0.0.78

Python version

3.13

Operating System

Ubun tu

Use Case Description

I was debugging why a short answer by a user after a question of the bot was not triggered. I have discovered that unexpectedly the STT data identified was not used because the bot was supposedly in "Speaking" state. I must say that my answer was short but definitly after the bot has finished say the first sentence.

Looking at the logs, i have noticed a weird behaviour of the Frame BotStartSpeaking and BotStopSpeaking. After the last words of the sentence was said, i could observe the BotStopSpeaking Frame (as expected). However 1.5 later, again i can see another BotStartSpeaking and very quickly after the BotStopSpeaking. The TTS word was handled just after the 2nd BotStartSpeaking Frame was issued (leaving to its discard)

What was wrong is the 2nd BotStartingFrame triggered

Current Approach

I have tried to disable the silence that is sent in AudioContextWordTTSService._audio_context_task_handler and the behaviour disappeared.

Errors or Unexpected Behavior

A 2nd BotStartSpeakingFrame was triggered due to the silence added in AudioContextWordTTSService._audio_context_task_handler

    async def _audio_context_task_handler(self):
        """In this task we process audio contexts in order."""
        running = True
        while running:
            context_id = await self._contexts_queue.get()

            if context_id:
                # Process the audio context until the context doesn't have more
                # audio available (i.e. we find None).
                await self._handle_audio_context(context_id)

                # We just finished processing the context, so we can safely remove it.
                del self._contexts[context_id]

                # Append some silence between sentences.
                silence = b"\x00" * self.sample_rate
                frame = TTSAudioRawFrame(
                    audio=silence, sample_rate=self.sample_rate, num_channels=1
                )
                logger.debug("AudioContextWordTTSService: Push silence between audio contexts")
                await self.push_frame(frame)
            else:
                running = False

Additional Context

Logs:

025-09-22 19:01:43.892 | DEBUG    | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1)
......
2025-09-22 19:01:43.971 | DEBUG    | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1)
2025-09-22 19:01:44.532 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.3607361316680908 0.35
2025-09-22 19:01:44.535 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:586 - Bot stopped speaking
2025-09-22 19:01:44.572 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.40029406547546387 0.35
2025-09-22 19:01:44.610 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call 
....
2025-09-22 19:01:45.132 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.9602780342102051 0.35
2025-09-22 19:01:45.138 | DEBUG    | pipecat.services.tts_service:_audio_context_task_handler:837 - AudioContextWordTTSService: Push silence between audio contexts
2025-09-22 19:01:45.139 | DEBUG    | pipecat.observers.loggers.debug_log_observer:on_push_frame:217 - PatchedElevenLabsTTSService#0 → FastAPIWebsocketOutputTransport#0: TTSAudioRawFrame sample_rate: 8000, num_channels: 1, num_frames: 4000, id: 536, name: 'TTSAudioRawFrame#49', metadata: {} at 3.90s <---- GENERATION OF SILENCE
2025-09-22 19:01:45.170 | DEBUG    | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1) <-------- THIS ONE SHOULD NOT HAPPEN
2025-09-22 19:01:45.171 | DEBUG    | pipecat.transports.base_output:_bot_started_speaking:570 - Bot started speaking
.....
2025-09-22 19:01:45.610 | DEBUG    | pipecat.transports.base_output:_audio_task_handler:691 - Setting is_speaking = True (1).  
2025-09-22 19:01:45.971 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.3618021011352539 0.35
2025-09-22 19:01:45.973 | DEBUG    | pipecat.transports.base_output:_bot_stopped_speaking:586 - Bot stopped speaking
2025-09-22 19:01:46.011 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.40149402618408203 0.35
2025-09-22 19:01:46.052 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.4419729709625244 0.35
2025-09-22 19:01:46.091 | DEBUG    | pipecat.transports.base_output:with_mixer:655 - call _bot_stopped_speaking QueueEmpty 0.4809720516204834 0.35

Modified code to see the logs (base_output.py) in attachment

base_output_patched.py


anotine10 avatar Sep 22 '25 17:09 anotine10

@aconchillo @markbackman , probably it should be considered as a bug. WDYT? notify: @alexflorensa

anotine10 avatar Sep 22 '25 17:09 anotine10

can confirm this is an issue , using elevenlabs

disolaterX avatar Oct 09 '25 22:10 disolaterX

This is a known weakness in Pipecat. The bot's speaking state is based on a VAD-like heuristic. We have plans to improve how this works. @aconchillo has this on his to do list.

markbackman avatar Nov 06 '25 21:11 markbackman

Hi @markbackman @aconchillo, did you find any solution to this, I am facing the same issue.

omensky avatar Nov 22 '25 07:11 omensky

There is a WIP branch that @aconchillo has but he's been busy working on other things. You can see this response for guidance to avoid this issue: https://github.com/pipecat-ai/pipecat/issues/3092#issuecomment-3568102113

Note HTTP services are most vulnerable to this issue since they can be slower to return generated audio.

markbackman avatar Nov 23 '25 15:11 markbackman

@markbackman Changes suggested by @anotine10 might actually be better. Can we raise a PR for this? Do you see any downside to this?

omensky avatar Dec 07 '25 17:12 omensky