speech_recognition
speech_recognition copied to clipboard
Add `stream=` kwarg to `Recognizer.listen`
Support for receiving captured audio one chunk at a time, while continuing to use the wakeword and audio energy detection code.
Notably, Coqui.ai/DeepSpeech (the python STT package) support a streaming interface, which greatly improves interaction latency for continuous listening applications. Even for non-streaming interfaces, this implementation allows for eager encoding (for example converting to numpy buffers, or even precomputing transformer KVs), or just an earlier start to transmission (when using websockets or other chunked transfer mechanisms).
Note: This is a minimal extraction from a larger edit I have in a side project. There, I ended up carving up huge chunks of recognizer to make it a bit more observable (i.e. trigger events based on speech detection start/stop aside from yielding audio, as well as real-time events for audio-energy threshold and detected value). This is a much smaller edit, but I have not vetted it as well. I am in the process of adopting this change directly into a new project leveraging self-hosted whisper over http.