agents icon indicating copy to clipboard operation
agents copied to clipboard

Allow stt suppression by vad

Open preciselyV opened this issue 2 weeks ago • 1 comments

As discussed in #3918 some STTs may misfire creating false speech recognition. To fix this @chenghao-mou suggested rewriting StreamAdapter to be able to work with stream capable STT's and only send STT's events if VADEvent.START_OF_SPEECH was generated.

Implementation is a bit different from the suggestion:

  1. Input chunks are always sent via push_frame() to both STT and VAD streams instead of starting to send to STT after according VAD events. Decided to go this route, since some STTs may cache previously received frames to improve prediction, and sending after VAD event is guaranteed "eat" at least one chunk with user speech. Don't wanna spoil accuracy at all
  2. Instead of calling stt._recognize_impl() to get NotImplementedError for stream only STTs, we check for stt.capabilities and new parameter called force_stream in StreamAdapter . Calling the stt._recognize_impl() if it is implemented, will introduce unnecessary API call, and since it will be done during initialization will affect performance. It's also better to let user decide which mode he'd like to use

By default force_stream=False in both StreamAdapter and StreamAdapterWrapper for backward compatibility. Unless set to True it will be using old logic, so no harm will be done to anyone who was already counting on it.

preciselyV avatar Nov 17 '25 08:11 preciselyV