agents
agents copied to clipboard
support using VAD with a streaming STT
add a STTCapabilities.flush to indicate if the stt supports flush (manual commit), and make stt.StreamAdapter work with streaming STT.
use cases:
- only send audio frames to STT when VAD detects user speech.
- support manual commit of elevenLabs scribe v2, fix https://github.com/livekit/agents/pull/3909#pullrequestreview-3481227767
related to https://github.com/livekit/agents/issues/3881, should be merge to main when https://github.com/livekit/agents/pull/4041 is done
Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a
APIStatusError(message="ElevenLabs STT connection closed unexpectedly")with aWSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.
it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.
update: add a silence_mode: Literal["drop", "zeros", "passthrough"] option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.
Something is off here. Whenever I take a pause longer than a few seconds, the connection will throw a
APIStatusError(message="ElevenLabs STT connection closed unexpectedly")with aWSMessage(type=<WSMsgType.CLOSE: 8>, data=1000, extra=''), but it doesn't happen with the non-wrapped version.it seems the elevenlabs STT has a timeout on audio input, maybe need an option to aways send audio to the STT.
update: add a
silence_mode: Literal["drop", "zeros", "passthrough"]option to send original or zero filled frames when VAD is negative. it's true that not every STT supports discontinued audio frames.
Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:
11:15:00 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}
11:15:03 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}
11:15:04 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}
11:15:07 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}
11:15:08 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}
with the zero silence.
Thanks for adding that option this quickly. However, I don't think it works well with 11labs: I am getting this:
11:15:00 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 11:15:03 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 11:15:04 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 11:15:07 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'} 11:15:08 DEBUG livekit.plugins… Received message type partial_transcript: {'message_type': 'partial_transcript', 'text': '*static*'}with the zero silence.
I think that's the issue of elevenlab, even passthrough the audio, it may generate either these tags or some random characters if there is a slight background noise.
when we enabled the interruption from interim transcript, this actually breaks the agent playout and for now I don't think there is a good solution. I would expect they will improve their VAD model or fix this.
Yeah, I agree. Should we just add a warning somewhere in the example or readme? I think it is totally fine to have the implementation available.
@chenghao-mou update: ~it seems it elevenlabs STT works when server VAD is disabled~ it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.
stt=stt.StreamAdapter(
stt=elevenlabs.STT(
use_realtime=True,
server_vad=None, # disable server-side VAD
language_code="en",
),
vad=ctx.proc.userdata["vad"],
use_streaming=True,
),
you can test it with this example https://github.com/livekit/agents/blob/longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py
@chenghao-mou update: ~it seems it elevenlabs STT works when server VAD is disabled~ it's better when server VAD is disabled, but still sometimes got some random output from STT, because it will generate a text no matter it's silent or even just noise.
stt=stt.StreamAdapter( stt=elevenlabs.STT( use_realtime=True, server_vad=None, # disable server-side VAD language_code="en", ), vad=ctx.proc.userdata["vad"], use_streaming=True, ),you can test it with this example
longc/stream-stt-flush/examples/other/elevenlab_scribe_v2.py
Yes, that was how I tested. It just hallucinates a lot no matter what options I tried.