dsnote
dsnote copied to clipboard
Guidance about settings for realtime STT on GPU
I primarily use this as an accessibility tool and the closer to realtime my outputs are, the more effective the tool is for me, especially in meetings and events that I attend. I'd like to have guidance on the options that are best for my use case as it's not obvious from the interface or in the documentations I've found.
I'm struggling to figure out which models/settings will produce output closest to realtime. https://github.com/abb128/LiveCaptions does a better job than I've been able to get with speechnote (even with the same AprilASR model) so I'm sure there's better options that I can use or that SpeechNote can implement, but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell.
Hi Alex,
To get realtime results, use models for engines that support "Intermediate Results". Currently, all DeepSpeech/Coqui, Vosk and April have this capability. I suspect you already know this. In my opinion Vosk provides the best quality comparing to accuracy but April is not bad as well.
but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell
Actually, Speech Note supports GPU acceleration only for Whisper and Faster Whisper models. Inference for April, Vosk and DeepSpeech is always done with CPU.
LiveCaptions does a better job than I've been able to get with speechnote
Unfortunately, there is no hidden option to speed up STT right now. Indeed, LiveCaptions uses exactly the same engine and models, so in theory there should be no differences 🤔. Perhaps VAD is a problem. Speech Note uses VAD processing before STT and this might add additional delay.. maybe.
I will investigate what can be done to make STT more real-time.
Just tested LiveCaptions and STT in that app is ridiculously fast. Wow, amazing!
Inference for April, Vosk and DeepSpeech is always done with CPU.
Is this true? It is so fast I thought it was on GPU?
Is this true? It is so fast I thought it was on GPU?
That's right, only CPU. April, Vosk and DeepSpeech are fast without GPU because they use different model architectures (usually older one with significantly worse accuracy). Whisper is based on GPT, which requires a lot of computing power, so it's slow without GPU acceleration.