pipecat icon indicating copy to clipboard operation
pipecat copied to clipboard

Why does Smart Turn Analyzer Have USE_ONLY_LAST_VAD_SEGMENT=True

Open BowerJames opened this issue 1 month ago • 5 comments

pipecat version

0.0.91

Python version

3.12.2

Operating System

macOS 26.0.1

Question

I am confused as to why USE_ONLY_LAST_VAD_SEGMENT is set to True in pipecat.audio.turn.smart_turn.base_smart_turn.

More specifically I am confused as to why this is the case and non configurable for the LocalSmartTurnAnalyzerV3.

From what I can see the only affect this has is to enforce the audio buffer to clear when the VAD goes to quiet even if the turn analyzer determined the turn to not be over.

In the smart turn repository it specifically says

Smart Turn takes 16kHz PCM audio as input. Up to 8 seconds of audio is supported, and we recommend providing the full audio of the user's current turn.

and

If additional speech is detected from the user before Smart Turn has finished executing, re-run Smart Turn on the entire turn recording, including the new audio, rather than just the new segment. Smart Turn works best when given sufficient context, and is not designed to run on very short audio segments.

This seems to be in direct contradiction to this advice. Could someone explain what I am missing?

What I've tried

I have looked through the documentation and not found any mention.

In the code it has the comment

# not exposing this for now yet until the model can handle it.
# use_only_last_vad_segment: bool = USE_ONLY_LAST_VAD_SEGMENT

But again the documentation suggests that smart turn v3 model is meant to work with it set to False.

Context

No response

BowerJames avatar Nov 19 '25 14:11 BowerJames

USE_ONLY_LAST_VAD_SEGMENT means the last user turn (e.g. audio between VADUserStartedSpeakingFrame andVADUserStoppedSpeakingFrame). This ensure that the latest sample is provided to the smart-turn model. How the speech is segmented depends on the VAD stop_secs param, which we recommend being set to a small value like 0.2sec. In most cases, this will capture the user's entire turn, but at a minimum, their last spoken segment (e.g. a few words) so that the model can analyze it.

Can you explain what you're trying to accomplish?

markbackman avatar Nov 19 '25 16:11 markbackman

Hi,

Thanks for your response, this largely came from a learning exercise with me trying to familiarise myself with pipecat and how the turn detection worked. I was looking into the situation in which I say something along the lines of:

"Sure my number is um.. 07123456789".

In this case the vad will start when I say start speaking and single I have stopped speaking at around the umm period. Then the turn detection will run and come back with turn incomplete. Importantly at this point because USE_ONLY_LAST_VAD_SEGMENT=True this causes the audio buffer in the turn analyzer to be cleared.

Then when I start saying the first number 0 before stop_sec has run out the vad will catch it and continue the users turn. But now the turn analyzer has a new audio buffer. This means that the audio buffer for the turn analyzer is no longer the full audio of the turn but just the last period of consistent speaking.

To illustrate this see below the logs with the length of the audio buffer printed when USE_ONLY_LAST_VAD_SEGMENT=True

2025-11-20 09:04:30.815 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Incomplete
2025-11-20 09:04:30.815 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.0170
2025-11-20 09:04:30.815 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 09:04:30.815 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 09:04:30.815 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 57.77ms
2025-11-20 09:04:30.815 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.INCOMPLETE
--------------------------------
Audio buffer length: 0
--------------------------------
2025-11-20 09:04:31.654 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Complete
2025-11-20 09:04:31.654 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.9436
2025-11-20 09:04:31.654 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 09:04:31.654 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 09:04:31.654 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 25.73ms
2025-11-20 09:04:31.654 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.COMPLETE
--------------------------------
Audio buffer length: 0
--------------------------------
2025-11-20 09:04:33.434 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Complete
2025-11-20 09:04:33.434 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.9756
2025-11-20 09:04:33.434 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 09:04:33.434 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 09:04:33.434 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 15.44ms
2025-11-20 09:04:33.435 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.COMPLETE
--------------------------------
Audio buffer length: 0
--------------------------------

Compare this to when it is set to False

--------------------------------
Audio buffer length: 0
--------------------------------
2025-11-20 11:11:00.753 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Incomplete
2025-11-20 11:11:00.753 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.0327
2025-11-20 11:11:00.753 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 11:11:00.753 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 11:11:00.753 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 13.60ms
2025-11-20 11:11:00.753 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.INCOMPLETE
--------------------------------
Audio buffer length: 6
--------------------------------
2025-11-20 11:11:01.379 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Incomplete
2025-11-20 11:11:01.379 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.1288
2025-11-20 11:11:01.379 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 11:11:01.379 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 11:11:01.379 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 13.93ms
2025-11-20 11:11:01.379 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.INCOMPLETE
--------------------------------
Audio buffer length: 15
--------------------------------
2025-11-20 11:11:02.253 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:238 - Prediction: Complete
2025-11-20 11:11:02.253 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:241 - Probability of complete: 0.8594
2025-11-20 11:11:02.253 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:242 - Inference time: 0.00ms
2025-11-20 11:11:02.253 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:243 - Server total time: 0.00ms
2025-11-20 11:11:02.254 | TRACE    | pipecat.audio.turn.smart_turn.base_smart_turn:_process_speech_segment:244 - E2E processing time: 28.51ms
2025-11-20 11:11:02.254 | DEBUG    | pipecat.audio.turn.smart_turn.base_smart_turn:analyze_end_of_turn:165 - End of Turn result: EndOfTurnState.COMPLETE
--------------------------------
Audio buffer length: 0
--------------------------------

You can see in the second one the audio buffer is maintained for the length of the turn and only reset at an end of turn which seems to be more in line with the intended use of the smart turn analyzer.

BowerJames avatar Nov 20 '25 11:11 BowerJames

Right. The idea is that the smart-turn model was trained on audio segments of 2-8 sec (IIRC) and the content of the audio is irrelevant. It's the audio frequency/pattern itself that matters to the model. So, the sequence of numbers in your case (e.g. 07123456789) are what the model uses to determine if an end of turn has occurred. The model is trained on inputs along these lines.

Tagging @marcus-daily to correct me and/or supplement my response.

markbackman avatar Nov 23 '25 16:11 markbackman

Hi, I think I understand. However, I do not believe this answers why pipecat applying the smart turn v3 model in a manner that goes against the recommendations in the docs of https://github.com/pipecat-ai/smart-turn.

It clearly says in the readme:

Notes on input format
Smart Turn takes 16kHz PCM audio as input. Up to 8 seconds of audio is supported, and we recommend providing the full audio of the user's current turn.

The model is designed to be used in conjunction with a lightweight VAD model such as Silero. Once the VAD model detects silence, run Smart Turn on the entire recording of the user's turn, truncating from the beginning to shorten the audio to around 8 seconds if necessary.

If additional speech is detected from the user before Smart Turn has finished executing, re-run Smart Turn on the entire turn recording, including the new audio, rather than just the new segment. Smart Turn works best when given sufficient context, and is not designed to run on very short audio segments.

Note that audio from previous turns does not need to be included.

It mentions that the entire 'turns' recording should be passed to the model (with truncation if required). It specifically mentions that if speech is detected again before the turn is over you should "re-run Smart Turn on the entire turn recording, including the new audio, rather than just the new segment". I have clearly demonstrated above that with USE_ONLY_LAST_VAD_SEGMENT=True this is not the case and only the new audio is being used.

If this is a design choice that was made then I am curious as to what the reason is to go against the recommendation of the models repository but am happy for the issue to be closed.

BowerJames avatar Nov 24 '25 09:11 BowerJames

Agreed, I think we should set USE_ONLY_LAST_VAD_SEGMENT to false by default. As much context from the current turn as possible should be provided to the model. BaseSmartTurn was written before that README and so the recommendations may not have been clear.

marcus-daily avatar Nov 24 '25 13:11 marcus-daily

I've opened a PR to address this: https://github.com/pipecat-ai/pipecat/pull/3183

marcus-daily avatar Dec 04 '25 11:12 marcus-daily