whisper_android icon indicating copy to clipboard operation
whisper_android copied to clipboard

Realtime use possible?

Open eix128 opened this issue 2 years ago • 31 comments

There are some whisper realtime libraries out there. Is there any possible way to make this library realtime ?

eix128 avatar Sep 30 '23 21:09 eix128

Yes. It is possible. We are working on capturing audio from mic and transcribe in realtime.

vilassn avatar Oct 03 '23 05:10 vilassn

Yes. It is possible. We are working on capturing audio from mic and transcribe in realtime.

Like webGPU? It's awesome if I got it right

KihongK avatar Jun 20 '24 02:06 KihongK

Any updates?

she7ata7 avatar Aug 13 '24 10:08 she7ata7

we've already used vosk model for real-time process but with it's low accuracy, we highly need the performance of whisper's audio transcription in real-time, but the inference time is taking so long(about 3seconds average). is there any solution we can use?

salehsoleimani avatar Oct 27 '24 10:10 salehsoleimani

Any updates?

up

salehsoleimani avatar Nov 05 '24 09:11 salehsoleimani

use whisper-v3-turbo model.Its 3x faster then others also make sure you enable cuda or OpenCL on AMD

eix128 avatar Nov 05 '24 11:11 eix128

use whisper-v3-turbo model.Its 3x faster then others also make sure you enable cuda or OpenCL on AMD

thanks but I searched and found large-v3-turbo, but I didn't found a tiny version of turbo:( i need tiny model to run the inference model on mobile devices

salehsoleimani avatar Nov 05 '24 11:11 salehsoleimani

Yes. It is possible. We are working on capturing audio from mic and transcribe in realtime.

any updates?

salehsoleimani avatar Nov 05 '24 12:11 salehsoleimani

Well , is there any way to prevent and ignore some other languages like arabic , indian etc... and load only 1 language model so we can optimize it further ? there is some option on struct but this doesnt optimize the model.

eix128 avatar Nov 05 '24 14:11 eix128

Well , is there any way to prevent and ignore some other languages like arabic , indian etc... and load only 1 language model so we can optimize it further ? there is some option on struct but this doesnt optimize the model.

it already does so. this repository already have two tflite models for both English and multilingual models but I didn't notice much of difference

salehsoleimani avatar Nov 05 '24 14:11 salehsoleimani

english only.it should be language selection.If i want hebrew , i should do it.

eix128 avatar Nov 06 '24 12:11 eix128

@salehsoleimani , did you tried 8 bit models ? https://huggingface.co/ggerganov/whisper.cpp like : large-v3-turbo-q8_0

eix128 avatar Nov 08 '24 23:11 eix128

@salehsoleimani , did you tried 8 bit models ? https://huggingface.co/ggerganov/whisper.cpp like : large-v3-turbo-q8_0

i guess currently used model is a quantized version of whisper. isn't that so? @vilassn

salehsoleimani avatar Nov 11 '24 12:11 salehsoleimani

no , default model uses float32 or bfloat32 afaik also some more quantized models https://huggingface.co/ctranslate2-4you/distil-whisper-small.en-ct2-bfloat16

eix128 avatar Nov 11 '24 13:11 eix128

no , default model uses float32 or bfloat32 afaik also some more quantized models https://huggingface.co/ctranslate2-4you/distil-whisper-small.en-ct2-bfloat16

oh thanks. i'm gonna try the 8/5-bit models do you know that are these models' tflite is available or not?

salehsoleimani avatar Nov 11 '24 14:11 salehsoleimani

afaik , you can convert to ggml and its also good ?

Converting whisper fine-tuning models to ggml

python3 ./models/convert-h5-to-ggml.py whisper-large-v2-japanese-5k-steps whisper outputs

eix128 avatar Nov 11 '24 14:11 eix128

@salehsoleimani did you check for 8 bit models ? any news?

eix128 avatar Nov 19 '24 06:11 eix128

@salehsoleimani did you check for 8 bit models ? any news?

no, actually I've been tolled quantization doesn't affect on speed too much

salehsoleimani avatar Nov 27 '24 20:11 salehsoleimani

@salehsoleimani try https://huggingface.co/distil-whisper/distil-large-v3 distil whisper is 6x faster then regular v3. Can you try to use distil whisper ?

eix128 avatar Nov 27 '24 20:11 eix128

also try here and return back please https://github.com/huggingface/distil-whisper they claim code will run 6x faster then whisper v3 Also can someone tell me how to convert these to whisper compatible format ? gguf ? ggml ? i dont know here

eix128 avatar Nov 27 '24 20:11 eix128

also try here and return back please https://github.com/huggingface/distil-whisper they claim code will run 6x faster then whisper v3

sure thanks i'm gonna try it out

salehsoleimani avatar Nov 27 '24 20:11 salehsoleimani

also try here and return back please https://github.com/huggingface/distil-whisper they claim code will run 6x faster then whisper v3 Also can someone tell me how to convert these to whisper compatible format ? gguf ? ggml ? i dont know here

I don't think that's gonna work out. whisper tiny (used in this repository) has 39M parameters. the distil-whisper starts from whisper-small unfortunately which has 166M params(compared to whisper-small which is 244M params it's faster) but compared to whisper-tiny it's not

salehsoleimani avatar Dec 02 '24 09:12 salehsoleimani

@vilassn https://github.com/vilassn/whisper_android/issues/1#issuecomment-1744233582 were you successful with this? I see that you implemented some VAD code in the repo (vad.cpp), but I don't see where's it's being used anywhere. I tried to clone the repo too and didn't see any real time transcription in the sample

saamerm avatar Feb 11 '25 10:02 saamerm

@vilassn #1 (comment) were you successful with this?

yes but pay me 1000 $ to tell you

salehsoleimani avatar Feb 11 '25 10:02 salehsoleimani

@salehsoleimani unfortunately I'm trying to use it for a free app for the Deaf: "BeAware Deaf Assistant", so I need to figure out how to do it the hard way. Do you have a video of it working well for you?

saamerm avatar Feb 11 '25 10:02 saamerm

@salehsoleimani unfortunately I'm trying to use it for a free app for the Deaf: "BeAware Deaf Assistant", so I need to figure out how to do it the hard way. Do you have a video of it working well for you?

sorry I was kidding. I couldn't figure out how to use real-time whisper, but we could manage it with vosk/kaladi, which gives real-time captions with good speed but medium accuracy. Have you checked it out?

salehsoleimani avatar Feb 11 '25 10:02 salehsoleimani

No worries, @salehsoleimani vosk was low accuracy when I tried their android demo from here: https://github.com/alphacep/vosk-android-demo More over it doesn't do punctuation marks (. , ! ? "") at all compared to whisper & its models, along with correct sentence end and start

saamerm avatar Feb 11 '25 10:02 saamerm

No worries, @salehsoleimani vosk was low accuracy when I tried their android demo from here: https://github.com/alphacep/vosk-android-demo More over it doesn't do punctuation marks (. , ! ? "") at all compared to whisper & its models, along with correct sentence end and start

yeah it's because of how whisper works. processes sentence by sentence each starting with end of a sentence and ending with a period or end of sentence which makes it perfect but unfortunately I couldn't make any result to run it real time on mobile :( I'm sorry

salehsoleimani avatar Feb 11 '25 11:02 salehsoleimani

try out : https://github.com/eix128/WhisperJET this library is best for java including android

eix128 avatar Feb 11 '25 12:02 eix128

try out : https://github.com/eix128/WhisperJET this library is best for java including android

oh yeah! :) thanks a lot I'm gonna try this out

salehsoleimani avatar Feb 11 '25 18:02 salehsoleimani