agents icon indicating copy to clipboard operation
agents copied to clipboard

support using VAD with a streaming STT

Open longcw opened this issue 2 weeks ago • 6 comments

add a STTCapabilities.flush to indicate if the stt supports flush (manual commit), and make stt.StreamAdapter work with streaming STT.

use cases:

  1. only send audio frames to STT when VAD detects user speech.
  2. support manual commit of elevenLabs scribe v2, fix https://github.com/livekit/agents/pull/3909#pullrequestreview-3481227767

related to https://github.com/livekit/agents/issues/3881, should be merge to main when https://github.com/livekit/agents/pull/4041 is done

longcw avatar Nov 21 '25 06:11 longcw

Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a APIStatusError(message="ElevenLabs STT connection closed unexpectedly") with a WSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.

it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.

update: add a silence_mode: Literal["drop", "zeros", "passthrough"] option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.

longcw avatar Nov 21 '25 10:11 longcw

Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a APIStatusError(message="ElevenLabs STT connection closed unexpectedly") with a WSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.

it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.

update: add a silence_mode: Literal["drop", "zeros", "passthrough"] option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.

Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:

    11:15:00 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:03 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:04 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:07 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:08 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 

with the zero silence.

chenghao-mou avatar Nov 21 '25 11:11 chenghao-mou

Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:

    11:15:00 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:03 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:04 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:07 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}  
    11:15:08 DEBUG  livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 

with the zero silence.

I think that's the issue of elevenlab, even passthrough the audio, it may generate either these tags or some random characters if there is a slight background noise.

when we enabled the interruption from interim transcript, this actually breaks the agent playout and for now I don't think there is a good solution. I would expect they will improve their VAD model or fix this.

longcw avatar Nov 21 '25 11:11 longcw

Yeah, I agree. Should we just add a warning somewhere in the example or readme? I think it is totally fine to have the implementation available.

chenghao-mou avatar Nov 21 '25 11:11 chenghao-mou

@chenghao-mou update: ~it seems it elevenlabs STT works when server VAD is disabled~ it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.

stt=stt.StreamAdapter(
            stt=elevenlabs.STT(
                use_realtime=True,
                server_vad=None,  # disable server-side VAD
                language_code="en",
            ),
            vad=ctx.proc.userdata["vad"],
            use_streaming=True,
        ),

you can test it with this example https://github.com/livekit/agents/blob/longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py

longcw avatar Nov 21 '25 11:11 longcw

@chenghao-mou update: ~it seems it elevenlabs STT works when server VAD is disabled~ it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.

stt=stt.StreamAdapter(
            stt=elevenlabs.STT(
                use_realtime=True,
                server_vad=None,  # disable server-side VAD
                language_code="en",
            ),
            vad=ctx.proc.userdata["vad"],
            use_streaming=True,
        ),

you can test it with this example longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py

Yes, that was how I tested. It just hallucinates a lot no matter what options I tried.

chenghao-mou avatar Nov 21 '25 11:11 chenghao-mou