vosk-api icon indicating copy to clipboard operation
vosk-api copied to clipboard

Improving Speed for Realtime Applications (On CPU) load, since it seems vosk-api is single threaded?

Open accessiblepixel opened this issue 2 years ago • 9 comments

Hey folks :)

I'm writing a program using the python version of the vosk-api and I have a few questions.

I'm wondering if there's any way to make the recogniser faster? I'm using the large 1.8G model en-us (latest) but sometimes it lags behind the realtime stream from the microphone.

I was thinking I could just throw more CPU cores at it, but it appears that the recogniser is limited to a single thread, or am I mistaken there?

Are there any settings or config tweaks I can use to get a bit more performance out of it? I'm running it on an intel i7 4790k at 4GHz (but do have the potential to run it on a 24 core Xeon system at 2.4Ghz per core, but given it's single threaded I don't think that'll give me any faster recognitions)

I started using the test_microphone.py script, is there anything I can change about it to get just a bit more speed from the recogniser, so it can keep up with real time audio? It's okay if there's only a few words spoken but kinda slows down slower than realtime if there's a long sentence spoken.

I read in another issue that changing the sampling rate to only 16,000Hz should improve speed but isn't test_microphone.py already doing that?

Can anyone point me in the right direction? I've read all the documentation and there's not a decent list of any advanced options or tweaks to use to just give it that bit of a performance boost.

I appreciate any pointers you can give me in the right direction, because I'm a bit lost. Thanks in advanced :)

Kind regards, Jessica.

accessiblepixel avatar Jan 01 '22 20:01 accessiblepixel

Hey. You can reduce beams in conf/model.conf. You can also remove rnnlm folder, it will be slightly less accurate but much faster.

Let us know if it works for you

nshmyrev avatar Jan 01 '22 23:01 nshmyrev

In general, there are many ways to improve. It depends on how much effort you want to put into it.

nshmyrev avatar Jan 01 '22 23:01 nshmyrev

I don't mind some effort, not at all, but I suppose it all has to be effort vs convenience.

Thanks for giving me those options to have a play with. If there's any other suggestions that might help, I'm all for it.

Python isn't my speciality but if it's something that'd be useful for the project, I'm willing to write up some help files for improving speed or accuracy and what options you can use as tradeoffs for speed/accuracy ratio to help other folks get started a bit easier?

Look forward to hearing any ideas you have, again many thanks.

Kind regards, Jessica

accessiblepixel avatar Jan 05 '22 14:01 accessiblepixel

I'm facing the same issue: single-thread recognition must be something with kaldi-asr.org config (used by vosk) or a simple parameter in the calls...

fbobraga avatar Mar 22 '22 13:03 fbobraga

Hi! @nshmyrev I'm using vosk-model-en-us-0.22-lgraph it takes around 2 secs to run test_simple.py with a wav input consisting of around 5-8 words. I'm using macbook air with dual-core Intel Core i5 processor.

How do I further improve the performance? As mentioned in the issue I tried reducing the beams but there is no considerable difference in sentences with 5-8 words. The time taken just changes from 2.5 to 2.1 or 1.9 secs. Is there anything else that you would suggest?

Ramanibharathi avatar May 10 '22 13:05 Ramanibharathi

@Ramanibharathi I can take a look but in general some time spent in initialization. Did you try to preload the model.Or is it 2 seconds after the model already loaded?

nshmyrev avatar May 10 '22 17:05 nshmyrev

Ah! I didn't think about loading the model in memory! Apologies for that. I tested it now Initialisation takes around 1-2 seconds the transcript is generated in a span of 0.1-0.9 seconds.

Ramanibharathi avatar May 11 '22 08:05 Ramanibharathi

Hi Jessica

I was thinking I could just throw more CPU cores at it, but it appears that the recogniser is limited to a single thread, or am I mistaken there?

@nshmyrev may you confirm this (single thread)? I guess that's correct.

Are there any settings or config tweaks I can use to get a bit more performance out of it? I'm running it on an intel i7 4790k at 4GHz (but do have the potential to run it on a 24 core Xeon system at 2.4Ghz per core, but given it's single threaded I don't think that'll give me any faster recognitions)

I tend to agree

I've read all the documentation and there's not a decent list of any advanced options or tweaks to use to just give it that bit of a performance boost.

I tend to agree

So, as other stated I think you can optimize on the language model you use.

BTW, what you mean with "speed"? I assume you men the latency time on run-time transcript. Almost 1 year ago I did some test I documented here: https://github.com/solyarisoftware/voskJs/tree/master/tests

broadly speaking, see these tests, done in a different contect (input file ina not-PCM format) instead of a streaimg input, but you get all timing details:

  • test 1 using a large language model, I experienced great elapsed times for brief sentences, on my pc (see details on documentation, obtaining almost 500/600 milliseconds. https://github.com/solyarisoftware/voskJs/tree/master/tests#without-a-grammar

  • test 2 But the good news is that you can obtain elapsed << 100 msecs if you use models that allows grammars: https://github.com/solyarisoftware/voskJs/tree/master/tests#with-a-grammar

BTW, please give a star if you like VoskJs project (unfortunately I do not maintained since 1 year ago)

I hope this helps giorgio

solyarisoftware avatar May 11 '22 09:05 solyarisoftware

Thank you everyone for the suggestions so far.

@nshmyrev you mention about loading the model (preloading) into memory, How would you go about that or is it done by default? I know you need a far bit of RAM to initially load the recogniser.

At the minute I have it running basically as a websocket interface for the recogniser output, taking in audio from a microphone, which once it's finished the current detection (usually after a sentence or natural speech pause) it does the recognition and sends that out over websocket, but it often slows down if the sentence is quite long (hence asking for ways to improve performance) since then it could recognise a long sentence, it take a bit longer, then if two shorter sentence come in, it ends up pushing the longer sentence off the page.

I'm considering seeing if I can run it on the best GHz single core clock system I have (which is my i7 which is clocked at 4GHz without turbo, or about 4.2GHz with turbo).

If I load the model into memory (the 1.8GB US one) will that also improve recognition times? I have a server with 72GB of memory so that wouldn't be much of an issue.

Also another idea, Is there a way I can have it so that it only ever processes a certain amount of 'seconds' or words, so it would then output maybe up to 10 words a time? Rather than trying to process the whole sentence and making it difficult for a viewer to read.

(I'm using it for realtime speech to text conversions for my streaming videos).

Thanks everyone for the input so far and I look forward to hearing any more ideas you might have :)

Kind regards, Jessica

accessiblepixel avatar May 11 '22 12:05 accessiblepixel