dsnote Guidance about settings for realtime STT on GPU

I primarily use this as an accessibility tool and the closer to realtime my outputs are, the more effective the tool is for me, especially in meetings and events that I attend. I'd like to have guidance on the options that are best for my use case as it's not obvious from the interface or in the documentations I've found.

I'm struggling to figure out which models/settings will produce output closest to realtime. https://github.com/abb128/LiveCaptions does a better job than I've been able to get with speechnote (even with the same AprilASR model) so I'm sure there's better options that I can use or that SpeechNote can implement, but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell.

Apr 23 '24 17:04 alexschneider

Hi Alex,

To get realtime results, use models for engines that support "Intermediate Results". Currently, all DeepSpeech/Coqui, Vosk and April have this capability. I suspect you already know this. In my opinion Vosk provides the best quality comparing to accuracy but April is not bad as well.

but LiveCaptions doesn't support GPU accelerated speech recognition as far as I can tell

Actually, Speech Note supports GPU acceleration only for Whisper and Faster Whisper models. Inference for April, Vosk and DeepSpeech is always done with CPU.

LiveCaptions does a better job than I've been able to get with speechnote

Unfortunately, there is no hidden option to speed up STT right now. Indeed, LiveCaptions uses exactly the same engine and models, so in theory there should be no differences 🤔. Perhaps VAD is a problem. Speech Note uses VAD processing before STT and this might add additional delay.. maybe.

I will investigate what can be done to make STT more real-time.

Apr 24 '24 18:04 mkiol

Just tested LiveCaptions and STT in that app is ridiculously fast. Wow, amazing!

Apr 24 '24 18:04 mkiol

Inference for April, Vosk and DeepSpeech is always done with CPU.

Is this true? It is so fast I thought it was on GPU?

Jun 28 '24 01:06 KUKHUA

Is this true? It is so fast I thought it was on GPU?

That's right, only CPU. April, Vosk and DeepSpeech are fast without GPU because they use different model architectures (usually older one with significantly worse accuracy). Whisper is based on GPT, which requires a lot of computing power, so it's slow without GPU acceleration.

Jun 28 '24 17:06 mkiol

dsnote dsnote copied to clipboard

Guidance about settings for realtime STT on GPU

dsnote
dsnote copied to clipboard