whisper.cpp "Double" the performance

"Double" the performance

Open janekb04 opened this issue 1 year ago • 15 comments

Well, it uses 50% less power - that's "double" performance. Basically, instead of using spinlocks, I made whisper.cpp use condition variables with mutices (mutex plural?).

Whisper is not really a low latency system. This means that busy locks aren't the best choice for synchronisation. The more so, that whisper.cpp is also supposed to run on the web and on mobile devices, where users usually care about power usage. In this PR, I made whisper.cpp use the classical conditional variable + mutex lock schema instead. On a 12900KS without overclocking, this reduces the CPU usage (and hence the power consumption) by half. On the other hand, if we go for full 100% utilization, the computation time is reduced by about 25%. Performance tables below.

This is a draft because I haven't implemented the lock using pthreads yet, and the current Windows implementation is rather naive and suboptimal. I am also yet to optimize the computations themselves.

Mar 26 '23 20:03 janekb04

Original version, 24 threads (95% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.3 GFLOPS (128 runs) / F32      0.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.7 GFLOPS (128 runs) / F32      1.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     10.1 GFLOPS (128 runs) / F32     19.1 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     61.7 GFLOPS (128 runs) / F32     74.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    138.7 GFLOPS ( 65 runs) / F32    162.6 GFLOPS ( 76 runs)
ggml_mul_mat:  2048 x  2048: F16    184.7 GFLOPS ( 11 runs) / F32    192.5 GFLOPS ( 12 runs)
ggml_mul_mat:  4096 x  4096: F16    174.4 GFLOPS (  3 runs) / F32     94.8 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	24	98	374	fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	24	153	1023	fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	24	437	2896	fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	24	1301	8510	fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	24	2563	16643	fd83fb2

New version, 24 threads (50% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      3.0 GFLOPS (128 runs) / F32      2.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     21.4 GFLOPS (128 runs) / F32     20.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.0 GFLOPS (128 runs) / F32    100.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    180.4 GFLOPS ( 84 runs) / F32    203.9 GFLOPS ( 95 runs)
ggml_mul_mat:  2048 x  2048: F16    207.3 GFLOPS ( 13 runs) / F32    179.8 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    182.9 GFLOPS (  3 runs) / F32    107.1 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	24	97	324	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	24	158	689	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	24	437	2384	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	24	1301	8923	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	24	2540	16748	de49899

Old version, 120 threads (99% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   128 x   128: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   256 x   256: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   512 x   512: F16      0.1 GFLOPS (  3 runs) / F32      0.2 GFLOPS (  3 runs)
ggml_mul_mat:  1024 x  1024: F16      1.6 GFLOPS (  3 runs) / F32      1.2 GFLOPS (  3 runs)
ggml_mul_mat:  2048 x  2048: F16     12.3 GFLOPS (  3 runs) / F32     12.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     68.5 GFLOPS (  3 runs) / F32     50.5 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	120	96	78836	fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	120	157	113952	fd83fb2
A while it took, indeed.

New version, 120 threads (90% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.1 GFLOPS (106 runs) / F32      0.1 GFLOPS (106 runs)
ggml_mul_mat:   128 x   128: F16      0.4 GFLOPS (106 runs) / F32      0.4 GFLOPS (104 runs)
ggml_mul_mat:   256 x   256: F16      3.5 GFLOPS (106 runs) / F32      3.5 GFLOPS (104 runs)
ggml_mul_mat:   512 x   512: F16     25.1 GFLOPS ( 94 runs) / F32     25.7 GFLOPS ( 96 runs)
ggml_mul_mat:  1024 x  1024: F16    129.0 GFLOPS ( 61 runs) / F32    127.7 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16    248.5 GFLOPS ( 15 runs) / F32    179.5 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    191.5 GFLOPS (  3 runs) / F32    121.3 GFLOPS (  3 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	tiny	120	98	583	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	base	120	158	972	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	small	120	435	2588	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	medium	120	1296	7457	de49899
12th Gen Intel(R) Core(TM) i9-12900KS	Microsoft Windows 11 Pro	AVX2	large	120	2540	12715	de49899

Mar 26 '23 20:03 janekb04

I also added a script that automatically runs all benchmarks on Windows. It is simply the shell script version converted into Powershell.

Mar 26 '23 20:03 janekb04

And here's the pthread version. Now this should be mergeable, though as I wrote, I am planning further optimizations. macOS tables below.

Mar 26 '23 21:03 janekb04

Original version, 6 threads (75% utilization)

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat:    64 x    64: F16      6.0 GFLOPS (128 runs) / F32      5.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     66.8 GFLOPS (128 runs) / F32     42.1 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    356.6 GFLOPS (128 runs) / F32    283.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    459.2 GFLOPS (128 runs) / F32    530.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    937.4 GFLOPS (128 runs) / F32   1379.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1217.2 GFLOPS ( 71 runs) / F32   1557.5 GFLOPS ( 91 runs)
ggml_mul_mat:  4096 x  4096: F16   1695.6 GFLOPS ( 13 runs) / F32   1431.4 GFLOPS ( 11 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	6	49	106	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	6	64	196	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	6	178	674	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	6	558	1940	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	6	1246	3547	fd83fb2

Original version, 10 threads (90% utilization)

Running ggml_mul_mat benchmark with 10 threads

ggml_mul_mat:    64 x    64: F16      3.7 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     35.1 GFLOPS (128 runs) / F32     19.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    223.6 GFLOPS (128 runs) / F32    130.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    597.1 GFLOPS (128 runs) / F32    574.2 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    773.1 GFLOPS (128 runs) / F32    528.9 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    599.6 GFLOPS ( 35 runs) / F32    485.6 GFLOPS ( 29 runs)
ggml_mul_mat:  4096 x  4096: F16   1005.8 GFLOPS (  8 runs) / F32    722.6 GFLOPS (  6 runs)

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	10	46	131	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	10	65	286	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	10	180	1105	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	10	526	3225	fd83fb2
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	10	1237	5546	fd83fb2

New version, 6 threads (45% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	6	49	121	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	6	66	203	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	6	175	667	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	6	507	1818	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	6	1315	3249	a6ee46f

New version, 10 threads (50% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]

CPU	OS	Config	Model	Th	Load	Enc.	Commit
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	tiny	10	42	129	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	base	10	67	240	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	small	10	176	744	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	medium	10	545	1933	a6ee46f
Apple M1 Pro	macOS 13.2.1 (22D68)	NEON BLAS	large	10	1419	3347	a6ee46f

Mar 26 '23 21:03 janekb04

I didn't expect performance on macOS to be so good. Anyway, it appears that for both versions, running with 6 threads is a performance sweet spot. The cool thing is that the new version uses about 40% less CPU while being about 9% faster.

Mar 26 '23 21:03 janekb04

This isn't ready yet, as, for some reason, the ggml_mul_mat benchmark deadlocks now. I'll look into this.

Mar 26 '23 21:03 janekb04

I wasn't able to measure the energy impact because the Activity Monitor is useless in that regard.

Mar 26 '23 22:03 janekb04

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

Mar 26 '23 22:03 janekb04

@janekb04 This is very nice work! I've always had doubts that the existing spin-lock approach is not optimal, but my attempts of adding mutex and condition variables were giving worse performance overall.

I haven't tested and looked at the proposed changes, but the reported results look promising. However, it is also important to measure the performance in the Decoder. It's different from the Encoder since there we don't rely on Accelerate's sgemm and there are high-frequency ggml_mul_mat calls for smaller matrices.

There is no existing benchmark for the Decoder, but you can simply run the transcription for some of the sample audio files and look at the reported time/per at the end.

I will take a more detailed look in the following days.

Mar 27 '23 05:03 ggerganov

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

The ggml.c in llama.cpp has some new extra stuff added and I haven't yet synchronized it with whisper.cpp. You won't be able to copy paste the ggml.c from here into llama.cpp - they are incompatible atm.

I will fix this soon. For now, you can just re-apply your changes to the ggml.c in llama.cpp to see how is the performance there

Mar 27 '23 05:03 ggerganov

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

Mar 30 '23 13:03 anzz1

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

I am currently developing a realtime await-async system for C++ that works just like described here. It has even better latency because a job-switch there is on the order of tens of nanoseconds. However it is rather early stage and very unstable. There are also existing systems that work like this (a few are in boost - asio, coroutines, fibers). Unfortunately, they are for C++, which allows for some nice syntactic sugar that wouldn't be possible in C (especially mine, as I overload the co_await operators. It has a bloated meta programming implementation but the user code looks very similar to Python or JavaScript). I don't know if it would be feasible to introduce that here.

As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some N threads. Then, it creates the "compute graph". I assume that it is a DAG with each node representing some computation. I assume that it is topologically sorted before the main for (int i = 0; i < cgraph->n_nodes; i++) loop. The loop sequentially goes through all the nodes. If the "compute graph" is indeed a sorted DAG, then, here comes the optimization: instead of going "for node in graph: for task in node:", the tasks from nondependent nodes could be run independently and each node. This would mean that fundamentally, the code would work like:

Main thread:

compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
    // The number of incoming edges
    // ie. the number of dependencies
    if (n.dependency_count.nonatomic_load() > 0)
        break;
    Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result]

Worker threads execute execute_work function:

Q.wait_for_start_working_blocking();
while (!Q.done()) {
    task to_do = Q.pop_blocking();
    execute(to_do);
    
    // if this was the last task for this node, the node has completed
    if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
        // so, all the node's dependents have one dependency less
        for (node& n : to_do.node.dependents) {
             // if the current node was the last dependency of this node
             // we can enqueue this node's tasks for execution
             if (n.dependency_count.atomic_fetch_sub(1) == 1) {
                 Q.batch_enqueue(n.tasks);
             }
        }
    }
}

This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work.

Apr 02 '23 13:04 janekb04

There are also a few minor things here and there. One I found is alloca, here:

struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;

using alloca is confusing for the compiler. It no longer has a function frame with locals positioned at deterministic addresses. Instead, it has to do more address computations that depend on the allocated memory block size.

Apr 02 '23 13:04 janekb04

Hi there -- Do you think your original changes will still work with llama.cpp backported updates? It would be pretty cool to have two strong performance improvements in a row!

Apr 13 '23 15:04 JKeddo95

@JKeddo95 I took my time to read through the changes and pulled them in and as far as I can tell, this PR is still valid.

Apr 17 '23 08:04 janekb04

To me this looks to be a very clean and well thought-out PR. I fully agree with the implementation provided and think this is absolutely the proper way to go.

There are multiple ways to implement locking, but the lightweight mutexes used here are the best option in most cases.

Spinlocks are rarely the right option, namely only when at least one of these conditions apply:

It's mission critical to release a lock as fast as possible, where the when needs to have nanosecond precision
If thread context switching needs to be unequivocally disallowed
When locking/sleeping/spinning very often and for very short (< ~1us) periods of time, where the cost (execution time) of context switch is larger than the time spent spinning

When even the CPU manufacturers themselves since at least 2011 advise against using spinlocks unless necessary, I'd take their word for it. But the beauty here is that we don't even have to take their word, when your performance tests confirm this to be true. Maybe 15 years ago there was a point in time where not letting the CPU sleep and doing stuff like disabling C-states in BIOS could improve performance, this hasn't been the case for a long time now. Processor tech has developed a lot since then and nowadays they perform better by letting them sleep and giving them headroom to manage the work. Especially now when we're reaching the tail-end of Moore's law, the thermal and power limits are more of an issue than ever before and processors are pushing what's possible to the limits and absolutely do gain performance from the ability to 'breathe' by decreasing the power use==thermal output of program code with methods like this (not wasting cycles by unnecessarily spinning and keeping the thread locked with usage at 100%).

Pros:

Uses lightweight mutexes which is the best locking option for this use-case (and most use-cases for that matter)
Uses the low-level locking mechanisms of POSIX threads and Windows critical sections
Doesn't use high-level abstractions like STL std::lock , which decrease performance by adding unnecessary code only to end up calling the same low-level functions at the end anyway. Having less abstraction also means easier low-level debugging,
Does all this in a minimal amount of code
For this use case I don't see how this implementation could have a performance loss in any configuration, there should be only gains. The gains should scale up when having less thermal (= power) headroom, with inadequately cooled and power-hungry configurations gaining the most. On a x86 laptop you'd have both situations, so those stand to gain the most.

Cons:

None that I can think of

Theres some more in-depth discussion about thread locking over at the llama.cpp on this (now abandoned) PR: [llama.cpp] ggml: refactor compute thread: merge three spin variables into one #816

It's a long thread and not necessarily everything applies, for example the proposition I made there about adding an #ifdef option for the different lock/sleep conditions I no longer agree with after having a second thought, as it would add unnecessary code complexity with no real advantages. I also made the point of being mindful about the cost of context switching, but this is clearly a non-issue here. To be clear there's nothing to be added from that discussion to this PR, but there is some good context and information for those interested to dig in deeper.

All in all, presentation-wise this is one of the best PR's I've ever seen anywhere in terms of reasoning and testing, especially the well laid-out performance tests along multiple architectures and operating systems is just perfect, more than can be reasonably expected from anyone. In fact, I am bookmarking this PR as an example on the subject of "How to make a perfect PR".

Few clarifications I would like to ask about though:

This is a draft because I haven't implemented the lock using pthreads yet, ...

You have now, so the PR description could be edited to reflect this?

... and the current Windows implementation is rather naive and suboptimal.

Can you elaborate on this? Looking through the code it looks like you have the most optimalest solution. Regular mutexes are kernel objects which require the thread to switch usermode->kernelmode->usermode whenever they are used unlike critical sections which can stay in usermode and afaik this is the reason they are so much faster. The CriticalSection/Condition paradigm is the fastest way to implement locking on Windows outside of spinlocks (which I wouldn't really call a mechanism anyway). The paradigm is widely used in the low-level OS and kernel, and my line of thinking usually goes that if something is good enough for low-level OS/kernel code, it's probably good full-stop.

I am also yet to optimize the computations themselves.

As that is a different beast altogether, I'd say better go ahead and merge this and not keep it unnecessarily blocked waiting further developments as on its' own this PR looks ready to go? In my opinion more & smaller PRs is a better option over less & larger ones anyway, as it makes maintainability easier and it's easier to revert small pieces if necessary when something goes wrong.

Apr 18 '23 20:04 anzz1

@anzz1 Thanks for the thoughtful evaluation. I updated the PR description.

Regarding the naïvety, I wrote that because I literally coded this off the top of my head, in one sitting. I just opened ggml.c, searched for "lock" and 2-3 hours later I was done. So I didn't really want to call this "optimal" or "complete" as it was something I quickly hacked. Also, I thought that it could be better to use a completely different scheduling approach, but that would be indeed a "beast" of a rework. So, yes, it looks to be mergable.

Apr 19 '23 07:04 janekb04

@ggerganov @janekb04 what would this require to get pulled into master? happy to take @janekb04 's work and clean it up / test it if that is all that is required.

Aug 29 '23 02:08 nchudleigh

I tried this and it happens to be way slower in my setup:

CPU: 6-core AMD Ryzen 5 5560U with Radeon Graphics (-MT MCP-) speed/min/max: 2459/1600/4061 MHz
Kernel: 6.2.0-34-generic x86_64 Up: 47m Mem: 5465.5/12879.6 MiB (42.4%)
Storage: 465.76 GiB (5.9% used) Procs: 304 Shell: Bash inxi: 3.3.25

In my tests, it goes from around 75s to process a 60 second audio to 103/110s (in 4 threads). Running top shows that the usage moves between 250% to 380%.

When running 6 threads, it goes between 250% and 560%, and it takes 114s.

The commit id in my logs shows 7a5a5fe86dfd9c3566b2c584a7553596bdae68ac.

Am I doing something wrong or dit this branch get outdated?

Oct 12 '23 14:10 lilezek

whisper.cpp whisper.cpp copied to clipboard

"Double" the performance

Original version, 24 threads (95% utilization)

New version, 24 threads (50% utilization)

Old version, 120 threads (99% utilization)

New version, 120 threads (90% utilization)

Original version, 6 threads (75% utilization)

Original version, 10 threads (90% utilization)

New version, 6 threads (45% utilization)

New version, 10 threads (50% utilization)

whisper.cpp
whisper.cpp copied to clipboard