pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

VAD not emitting UserStoppedSpeaking Frame, causing the bot to stuck and not respond

Open sphatate opened this issue 8 months ago • 12 comments

We are using VAD -> STT -> LLM -> TTS architecture.

Lot of time we have observer that bot don't respond, after debugging found that VAD is not emitting UserStoppedSpeaking Frame.

Further debugging made me realize that the issue is caused because

if ( self._vad_state == VADState.STOPPING and self._vad_stopping_count >= self._vad_stop_frames ): self._vad_state = VADState.QUIET self._vad_stopping_count = 0

In above condition _vad_stopping_count is always less than _vad_stop_frames, and i am not understanding the reason as to why is this happening. We have kept the stop_secs to 0.8, so

_vad_stop_frames value is ~35 and _vad_stopping_count always stop increment at range 10 to13 (i.e it stops incrementing above 13) due to which the condition is never satisfied.

sphatate avatar Apr 01 '25 08:04 sphatate

Two questions:

  • What are your VAD settings?
  • What version of Pipecat?

In 0.0.57, we added handling for the case where the VAD doesn't fire but a TranscriptionFrame is received. This will result in a completion occurs.

In my experience, this works robustly.

markbackman avatar Apr 03 '25 01:04 markbackman

Hi @markbackman

these are our vad settings

confidence = 0.5 start_secs = 0.2 stop_secs = 0.8 volume=0.5

We are using pipecat version 0.0.60

sphatate avatar Apr 03 '25 03:04 sphatate

Is it something to do with deepgram or azure transcriber, we are facing this issue with both.

Also this is not happening with phone calls, rather this is happening with WebSocket when we do web-calls

sphatate avatar Apr 03 '25 03:04 sphatate

This is not a known issue. It sounds like something in your pipeline might be blocking the UserStoppedSpeakingFrame. Few questions:

  • Have you customized any parts of Pipecat?
  • Do you have any wrappers around the services (STT, LLM, or TTS)
  • Do you have any custom processors in your pipeline?

If yes to any of those, please make sure you're pushing frames down the pipeline in all cases.

markbackman avatar Apr 03 '25 11:04 markbackman

@sphatate any update? Otherwise, I'll close the issue.

markbackman avatar Apr 12 '25 01:04 markbackman

I have not customized anything, this is only happening in wesocket when using for web calls. We are not facing this issue with phone call on twilio

sphatate avatar Apr 12 '25 02:04 sphatate

Do you have a single file repro that you can share? Also with what to look for?

markbackman avatar Apr 12 '25 02:04 markbackman

@sphatate is this still an issue for you?

markbackman avatar Apr 18 '25 13:04 markbackman

Hi @markbackman

We are having same issue, we are facing this on telephony exact same issue. We can share all the details, we have tried to explore what might be causing this and VAD start with user_started_speaker frame but even after multiple seconds of silence it won't trigger user_stopped_speaking and gets stuck.

confidence = 0.9 start_secs = 0.2 stop_secs = 1.0 volume=0.6

our version is 0.0.63

tieincred avatar Apr 23 '25 13:04 tieincred

@tieincred, may I ask by you set confidence = 0.9? This might be causing issues for the VAD, as you're making it more difficult to identify user speech. I'd highly recommend using the default confidence value.

markbackman avatar Apr 24 '25 15:04 markbackman

@markbackman Thanks a lot for replying so quickly.

But if cutoff confidence is high shouldn't it be easier for VAD to get "no speech/voice activity" because it needs to be really high confidence to classify the chunk as speech/voice activity?

Apart from that I had changed to confidence to a 0.9 because I was facing same issue with default.

tieincred avatar Apr 24 '25 15:04 tieincred

To be helpful on this issue, I think I'll need a single file repro of the issue along with repro steps.

I've done a significant amount of testing over the months and have never seen this issue. This logic hasn't changed in quite some time an is battle tested.

Asking @aconchillo for an opinion, too.

markbackman avatar Apr 24 '25 17:04 markbackman

Facing the same issue during interruptions. v0.69. pipeline gets frozen sometimes

kartikay1999 avatar Jun 14 '25 08:06 kartikay1999

What transport is being used?

We've done some work in this area in the following PR: #2004

There's a failure mode specific to websocket transports that can leave the VAD stuck in a SPEAKING state. The issue is that websockets communicate with TCP and don't handle network disconnects. This results in a state where the transport no longer provides input audio, which prevents the VAD state from changing.

Consider this example:

  • User starts speaking -> VAD transitions to SPEAKING state
  • Websocket disconnects
  • VAD stuck in SPEAKING state

When the websocket is no longer connected, there's no additional audio received by the VAD to change the state to QUIET. This is different than WebRTC which will communicate silence even when there are network disruptions. So, this failure mode is specific to websockets.

The PR I linked handles this case by detecting no input, triggering a timeout and warning, then switching the VAD state to QUIET.

This improvement will be included in the upcoming release.


Someone mentioned this above:

rather this is happening with WebSocket when we do web-calls

We strongly recommend using a WebRTC transport for client/server apps—for this reason and many others!

markbackman avatar Jun 21 '25 16:06 markbackman

Get the same with 0.0.95. I'm using LocalSmartTurnAnalyzerV3 and SileroVADAnalyzer(params=VADParams(stop_secs=0.2)) and Twilio via Websockets. It breaks Elevenlabs STT with manual commit_strategy, because it waits for explicit UserStoppedSpeakingFrame. Looks like that Silero fails to detect user started speaking

BubaVV avatar Nov 24 '25 13:11 BubaVV

I have not seen this. My guess is that you're impacted by an issue with the ElevenLabsRealtimeSTTService because you're setting the sample rate. That's being fixed here: https://github.com/pipecat-ai/pipecat/pull/3106

The VAD works reliably in my experience. I've never seen the VAD not emit an event indicating the user stopped speaking.

In fact, I'm closing this issue because it lacks a repo and activity. If someone has a repro case for this, we can investigate.

markbackman avatar Nov 24 '25 14:11 markbackman