speech-recognition-experiments Do whisper CT2(base model) achieve same speed as that of vosk (english large) with CPU

Do whisper CT2(base model) achieve same speed as that of vosk (english large) with CPU

Open fuhadabdulla opened this issue 2 years ago • 4 comments

Do whisper CT2(base model) achieve same speed as that of vosk (english large) on cpu only

Mar 03 '23 00:03 fuhadabdulla

Its a bit tricky to answer, because Vosk has a real streaming mode with partial results, meaning you don't have to wait until the user has finished speaking, but only have to transcribe the last chunk of audio left while Whisper basically starts transcribing AFTER the user finished. So the short answer is: the longer you speak the faster Vosk will be.

I haven't compared Whisper to Vosk in non-streaming mode yet. Maybe I'll add some tests for that.

Mar 03 '23 09:03 fquirin

Thank you for creating this comparison . because of this i tried out the faster whisper and It is faster than whisper cpp .

Mar 03 '23 09:03 fuhadabdulla

It is indeed, at least on ARM CPUs. You can follow the discussion about it here: https://github.com/ggerganov/whisper.cpp/issues/7#issuecomment-1447752474

It seems to be some optimization issue on ARM. Results on X86 (Intel/AMD) CPUs might show a different result and catch up to the CT2 version.

Mar 03 '23 10:03 fquirin

Hi @nyadla-sys , I wrote you on Twitter via SEPIA account 🙂

Mar 13 '23 12:03 fquirin

speech-recognition-experiments speech-recognition-experiments copied to clipboard

Do whisper CT2(base model) achieve same speed as that of vosk (english large) with CPU

speech-recognition-experiments
speech-recognition-experiments copied to clipboard