pipecat VAD not emitting UserStoppedSpeaking Frame, causing the bot to stuck and not respond

We are using VAD -> STT -> LLM -> TTS architecture.

Lot of time we have observer that bot don't respond, after debugging found that VAD is not emitting UserStoppedSpeaking Frame.

Further debugging made me realize that the issue is caused because

if ( self._vad_state == VADState.STOPPING and self._vad_stopping_count >= self._vad_stop_frames ): self._vad_state = VADState.QUIET self._vad_stopping_count = 0

In above condition _vad_stopping_count is always less than _vad_stop_frames, and i am not understanding the reason as to why is this happening. We have kept the stop_secs to 0.8, so

_vad_stop_frames value is ~35 and _vad_stopping_count always stop increment at range 10 to13 (i.e it stops incrementing above 13) due to which the condition is never satisfied.

Apr 01 '25 08:04 sphatate

Two questions:

What are your VAD settings?
What version of Pipecat?

In 0.0.57, we added handling for the case where the VAD doesn't fire but a TranscriptionFrame is received. This will result in a completion occurs.

In my experience, this works robustly.

Apr 03 '25 01:04 markbackman

Hi @markbackman

these are our vad settings

confidence = 0.5 start_secs = 0.2 stop_secs = 0.8 volume=0.5

We are using pipecat version 0.0.60

Apr 03 '25 03:04 sphatate

Is it something to do with deepgram or azure transcriber, we are facing this issue with both.

Also this is not happening with phone calls, rather this is happening with WebSocket when we do web-calls

Apr 03 '25 03:04 sphatate

This is not a known issue. It sounds like something in your pipeline might be blocking the UserStoppedSpeakingFrame. Few questions:

Have you customized any parts of Pipecat?
Do you have any wrappers around the services (STT, LLM, or TTS)
Do you have any custom processors in your pipeline?

If yes to any of those, please make sure you're pushing frames down the pipeline in all cases.

Apr 03 '25 11:04 markbackman

@sphatate any update? Otherwise, I'll close the issue.

Apr 12 '25 01:04 markbackman

I have not customized anything, this is only happening in wesocket when using for web calls. We are not facing this issue with phone call on twilio

Apr 12 '25 02:04 sphatate

Do you have a single file repro that you can share? Also with what to look for?

Apr 12 '25 02:04 markbackman

@sphatate is this still an issue for you?

Apr 18 '25 13:04 markbackman

Hi @markbackman

We are having same issue, we are facing this on telephony exact same issue. We can share all the details, we have tried to explore what might be causing this and VAD start with user_started_speaker frame but even after multiple seconds of silence it won't trigger user_stopped_speaking and gets stuck.

confidence = 0.9 start_secs = 0.2 stop_secs = 1.0 volume=0.6

our version is 0.0.63

Apr 23 '25 13:04 tieincred

@tieincred, may I ask by you set confidence = 0.9? This might be causing issues for the VAD, as you're making it more difficult to identify user speech. I'd highly recommend using the default confidence value.

Apr 24 '25 15:04 markbackman

@markbackman Thanks a lot for replying so quickly.

But if cutoff confidence is high shouldn't it be easier for VAD to get "no speech/voice activity" because it needs to be really high confidence to classify the chunk as speech/voice activity?

Apart from that I had changed to confidence to a 0.9 because I was facing same issue with default.

Apr 24 '25 15:04 tieincred

To be helpful on this issue, I think I'll need a single file repro of the issue along with repro steps.

I've done a significant amount of testing over the months and have never seen this issue. This logic hasn't changed in quite some time an is battle tested.

Asking @aconchillo for an opinion, too.

Apr 24 '25 17:04 markbackman

Facing the same issue during interruptions. v0.69. pipeline gets frozen sometimes

Jun 14 '25 08:06 kartikay1999

What transport is being used?

We've done some work in this area in the following PR: #2004

There's a failure mode specific to websocket transports that can leave the VAD stuck in a SPEAKING state. The issue is that websockets communicate with TCP and don't handle network disconnects. This results in a state where the transport no longer provides input audio, which prevents the VAD state from changing.

Consider this example:

User starts speaking -> VAD transitions to SPEAKING state
Websocket disconnects
VAD stuck in SPEAKING state

When the websocket is no longer connected, there's no additional audio received by the VAD to change the state to QUIET. This is different than WebRTC which will communicate silence even when there are network disruptions. So, this failure mode is specific to websockets.

The PR I linked handles this case by detecting no input, triggering a timeout and warning, then switching the VAD state to QUIET.

This improvement will be included in the upcoming release.

Someone mentioned this above:

rather this is happening with WebSocket when we do web-calls

We strongly recommend using a WebRTC transport for client/server apps—for this reason and many others!

Jun 21 '25 16:06 markbackman

Get the same with 0.0.95. I'm using LocalSmartTurnAnalyzerV3 and SileroVADAnalyzer(params=VADParams(stop_secs=0.2)) and Twilio via Websockets. It breaks Elevenlabs STT with manual commit_strategy, because it waits for explicit UserStoppedSpeakingFrame. Looks like that Silero fails to detect user started speaking

Nov 24 '25 13:11 BubaVV

I have not seen this. My guess is that you're impacted by an issue with the ElevenLabsRealtimeSTTService because you're setting the sample rate. That's being fixed here: https://github.com/pipecat-ai/pipecat/pull/3106

The VAD works reliably in my experience. I've never seen the VAD not emit an event indicating the user stopped speaking.

In fact, I'm closing this issue because it lacks a repo and activity. If someone has a repro case for this, we can investigate.

Nov 24 '25 14:11 markbackman