whisper.cpp How to get best performance on moderately old CPU and hardware?

Hello! Very nice project and port.

I am wondering, using an i7-6700 or even older CPU, is there possibility to get near realtime transcription? (At least as long to compute as it is to hear)

I am using the windows binaries in the release page and it takes about twice as long to generate the text than the audio length. I am looking to get that about even if possible.

As an aside, it would be nice to make use of older hardware using CUDA. I have an older GPU here (GT 730) which is not supported by PyTorch (Compute Capabilities 3.5 and pytorch supports at least 3.7). I am wondering if some "generic" optimization could be implemented by using CUDA even without fancy pytorch stuff. This would make for my older server to run this speech to text and possibly make many more people be able to get access to this wonderful tech (I'm thinking of vision-impaired people for example, that are not so rich to get the latest apple M3)

Thanks!

Mar 15 '23 03:03 qwertyuu

Your I7-6700 has Intel® AVX2 so it should do OK ...

But if it's too slow, AND you are running windows anyways, AND you have a GPU in the machine, it might be worth checking out this whisper implementation, based on whisper.cpp https://github.com/Const-me/Whisper

It uses Windows GPU compute API's (so works with any GPU card I believe, not just CUDA)

I've had VERY good results from it, just a pitty it's windows only... I wish there was a Vulkan compute implementation :) (hint to anyone who has a clue with Vulkan)

Jay

On Wed, 15 Mar 2023 at 13:05, Raphaël Côté @.***> wrote:

Hello! Very nice project and port.

I am wondering, using an i7-6700 or even older CPU, is there possibility to get near realtime transcription? (At least as long to compute as it is to hear)

I am using the windows binaries in the release page and it takes about twice as long to generate the text than the audio length. I am looking to get that about even if possible.

As an aside, it would be nice to make use of older hardware using CUDA. I have an older GPU here (GT 730) which is not supported by PyTorch (Compute Capabilities 3.5 and pytorch supports at least 3.7). I am wondering if some "generic" optimization could be implemented by using CUDA even without fancy pytorch stuff. This would make for my older server to run this speech to text and possibly make many more people be able to get access to this wonderful tech (I'm thinking of vision-impaired people for example, that are not so rich to get the latest apple M3)

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/611, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALQR623COGIQXJSWXBEXFLW4EWWFANCNFSM6AAAAAAV3HIG5A . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sincerely

Jay

Mar 15 '23 03:03 jaybinks

Thanks for taking time to answer me and to point me towards Const-me's approach! It is very nice to see that work is being done to propagate this tech to all the computers in the world.

Sadly, for my older server, the CPU seems either too old or too weak-abled to make it work (seems to lack F16C capability, bummer). I have yet to try it on my tower PC.

If anyone else has a suggestion on how I could go about running this on an old Pentium G3220, I'd love it :)

Mar 17 '23 02:03 qwertyuu

Try Turning off avx2 and fma and even avx defines at compile time if you have trouble. It worked for me.

Mar 21 '23 01:03 indokid9999

Try Turning off avx2 and fma and even avx defines at compile time if you have trouble. It worked for me.

Very great idea, I never thought of compiling myself. I just used the binaries in the release page. Thanks for the info! I will try soon and update this issue

Mar 21 '23 13:03 qwertyuu

Yes, Lets us know. I run a i5 from 2013. The sample jfk.wav takes 4.6 seconds to encode on single thread with tiny.en model. 2.6 seconds with 4 threads.

Mar 22 '23 01:03 indokid9999

I have a i5-8265U and I'm quite amazed at performance.

However... I noticed there's 2.7s overhead for base model (1.3s overhead for tiny) for a single word of input (<1s). I assume most of that is loading the model. A 10s input sample of several sentences takes only 25% longer.

That's plenty fast for transcriptions, but quite a bit of latency for interactive usage. I tried the stream app, but it didn't produce very good output text.

It would be great if Whisper.cpp could be built as a http server, maybe even implementing openai's protocol.

Mar 23 '23 19:03 mikeslattery

whisper.cpp whisper.cpp copied to clipboard

How to get best performance on moderately old CPU and hardware?

whisper.cpp
whisper.cpp copied to clipboard