VAD not emitting UserStoppedSpeaking Frame, causing the bot to stuck and not respond
We are using VAD -> STT -> LLM -> TTS architecture.
Lot of time we have observer that bot don't respond, after debugging found that VAD is not emitting UserStoppedSpeaking Frame.
Further debugging made me realize that the issue is caused because
if ( self._vad_state == VADState.STOPPING and self._vad_stopping_count >= self._vad_stop_frames ): self._vad_state = VADState.QUIET self._vad_stopping_count = 0
In above condition _vad_stopping_count is always less than _vad_stop_frames, and i am not understanding the reason as to why is this happening. We have kept the stop_secs to 0.8, so
_vad_stop_frames value is ~35 and _vad_stopping_count always stop increment at range 10 to13 (i.e it stops incrementing above 13) due to which the condition is never satisfied.
Two questions:
- What are your VAD settings?
- What version of Pipecat?
In 0.0.57, we added handling for the case where the VAD doesn't fire but a TranscriptionFrame is received. This will result in a completion occurs.
In my experience, this works robustly.
Hi @markbackman
these are our vad settings
confidence = 0.5 start_secs = 0.2 stop_secs = 0.8 volume=0.5
We are using pipecat version 0.0.60
Is it something to do with deepgram or azure transcriber, we are facing this issue with both.
Also this is not happening with phone calls, rather this is happening with WebSocket when we do web-calls
This is not a known issue. It sounds like something in your pipeline might be blocking the UserStoppedSpeakingFrame. Few questions:
- Have you customized any parts of Pipecat?
- Do you have any wrappers around the services (STT, LLM, or TTS)
- Do you have any custom processors in your pipeline?
If yes to any of those, please make sure you're pushing frames down the pipeline in all cases.
@sphatate any update? Otherwise, I'll close the issue.
I have not customized anything, this is only happening in wesocket when using for web calls. We are not facing this issue with phone call on twilio
Do you have a single file repro that you can share? Also with what to look for?
@sphatate is this still an issue for you?
Hi @markbackman
We are having same issue, we are facing this on telephony exact same issue. We can share all the details, we have tried to explore what might be causing this and VAD start with user_started_speaker frame but even after multiple seconds of silence it won't trigger user_stopped_speaking and gets stuck.
confidence = 0.9 start_secs = 0.2 stop_secs = 1.0 volume=0.6
our version is 0.0.63
@tieincred, may I ask by you set confidence = 0.9? This might be causing issues for the VAD, as you're making it more difficult to identify user speech. I'd highly recommend using the default confidence value.
@markbackman Thanks a lot for replying so quickly.
But if cutoff confidence is high shouldn't it be easier for VAD to get "no speech/voice activity" because it needs to be really high confidence to classify the chunk as speech/voice activity?
Apart from that I had changed to confidence to a 0.9 because I was facing same issue with default.
To be helpful on this issue, I think I'll need a single file repro of the issue along with repro steps.
I've done a significant amount of testing over the months and have never seen this issue. This logic hasn't changed in quite some time an is battle tested.
Asking @aconchillo for an opinion, too.
Facing the same issue during interruptions. v0.69. pipeline gets frozen sometimes
What transport is being used?
We've done some work in this area in the following PR: #2004
There's a failure mode specific to websocket transports that can leave the VAD stuck in a SPEAKING state. The issue is that websockets communicate with TCP and don't handle network disconnects. This results in a state where the transport no longer provides input audio, which prevents the VAD state from changing.
Consider this example:
- User starts speaking -> VAD transitions to SPEAKING state
- Websocket disconnects
- VAD stuck in SPEAKING state
When the websocket is no longer connected, there's no additional audio received by the VAD to change the state to QUIET. This is different than WebRTC which will communicate silence even when there are network disruptions. So, this failure mode is specific to websockets.
The PR I linked handles this case by detecting no input, triggering a timeout and warning, then switching the VAD state to QUIET.
This improvement will be included in the upcoming release.
Someone mentioned this above:
rather this is happening with WebSocket when we do web-calls
We strongly recommend using a WebRTC transport for client/server apps—for this reason and many others!
Get the same with 0.0.95. I'm using LocalSmartTurnAnalyzerV3 and SileroVADAnalyzer(params=VADParams(stop_secs=0.2)) and Twilio via Websockets. It breaks Elevenlabs STT with manual commit_strategy, because it waits for explicit UserStoppedSpeakingFrame. Looks like that Silero fails to detect user started speaking
I have not seen this. My guess is that you're impacted by an issue with the ElevenLabsRealtimeSTTService because you're setting the sample rate. That's being fixed here: https://github.com/pipecat-ai/pipecat/pull/3106
The VAD works reliably in my experience. I've never seen the VAD not emit an event indicating the user stopped speaking.
In fact, I'm closing this issue because it lacks a repo and activity. If someone has a repro case for this, we can investigate.