Recognition: real-time, streaming Whisper recognition

Open rotemdan opened this issue 2 years ago • 0 comments

Tokens are already decoded and displayed live during Whisper decoding, at least on the CLI.

Getting Whisper to recognize in real-time (or at least near real-time) is possible. However:

It's really important for me to get a low, usable latency. Preferably something that can be responsive enough for a real-time voice chat with a language model (along with low-latency synthesis, which is already mostly ready).
That would require some planning and code reorganization to get right.
Need to integrate an effective VAD (voice activity detection) strategy to cut the audio at the right places. Fortunately, Echogarden already has several working VAD implementations.

Jul 27 '23 15:07 rotemdan