agents
agents copied to clipboard
Allow stt suppression by vad
As discussed in #3918 some STTs may misfire creating false speech recognition. To fix this @chenghao-mou suggested rewriting StreamAdapter to be able to work with stream capable STT's and only send STT's events if VADEvent.START_OF_SPEECH was generated.
Implementation is a bit different from the suggestion:
- Input chunks are always sent via
push_frame()to both STT and VAD streams instead of starting to send to STT after according VAD events. Decided to go this route, since some STTs may cache previously received frames to improve prediction, and sending after VAD event is guaranteed "eat" at least one chunk with user speech. Don't wanna spoil accuracy at all - Instead of calling
stt._recognize_impl()to getNotImplementedErrorfor stream only STTs, we check forstt.capabilitiesand new parameter calledforce_streaminStreamAdapter. Calling thestt._recognize_impl()if it is implemented, will introduce unnecessary API call, and since it will be done during initialization will affect performance. It's also better to let user decide which mode he'd like to use
By default force_stream=False in both StreamAdapter and StreamAdapterWrapper for backward compatibility. Unless set to True it will be using old logic, so no harm will be done to anyone who was already counting on it.