whisper.cpp
whisper.cpp copied to clipboard
"Double" the performance
Well, it uses 50% less power - that's "double" performance. Basically, instead of using spinlocks, I made whisper.cpp
use condition variables with mutices (mutex plural?).
Whisper is not really a low latency system. This means that busy locks aren't the best choice for synchronisation. The more so, that whisper.cpp
is also supposed to run on the web and on mobile devices, where users usually care about power usage. In this PR, I made whisper.cpp
use the classical conditional variable + mutex lock schema instead. On a 12900KS without overclocking, this reduces the CPU usage (and hence the power consumption) by half. On the other hand, if we go for full 100% utilization, the computation time is reduced by about 25%. Performance tables below.
This is a draft because I haven't implemented the lock using pthreads
yet, and the current Windows implementation is rather naive and suboptimal. I am also yet to optimize the computations themselves.
Original version, 24 threads (95% utilization)
Running ggml_mul_mat benchmark with 24 threads
ggml_mul_mat: 64 x 64: F16 0.3 GFLOPS (128 runs) / F32 0.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 2.7 GFLOPS (128 runs) / F32 1.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 10.1 GFLOPS (128 runs) / F32 19.1 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 61.7 GFLOPS (128 runs) / F32 74.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 138.7 GFLOPS ( 65 runs) / F32 162.6 GFLOPS ( 76 runs)
ggml_mul_mat: 2048 x 2048: F16 184.7 GFLOPS ( 11 runs) / F32 192.5 GFLOPS ( 12 runs)
ggml_mul_mat: 4096 x 4096: F16 174.4 GFLOPS ( 3 runs) / F32 94.8 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | tiny | 24 | 98 | 374 | fd83fb2 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | base | 24 | 153 | 1023 | fd83fb2 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | small | 24 | 437 | 2896 | fd83fb2 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | medium | 24 | 1301 | 8510 | fd83fb2 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | large | 24 | 2563 | 16643 | fd83fb2 |
New version, 24 threads (50% utilization)
Running ggml_mul_mat benchmark with 24 threads
ggml_mul_mat: 64 x 64: F16 0.4 GFLOPS (128 runs) / F32 0.4 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 3.0 GFLOPS (128 runs) / F32 2.9 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 21.4 GFLOPS (128 runs) / F32 20.3 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 95.0 GFLOPS (128 runs) / F32 100.3 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 180.4 GFLOPS ( 84 runs) / F32 203.9 GFLOPS ( 95 runs)
ggml_mul_mat: 2048 x 2048: F16 207.3 GFLOPS ( 13 runs) / F32 179.8 GFLOPS ( 11 runs)
ggml_mul_mat: 4096 x 4096: F16 182.9 GFLOPS ( 3 runs) / F32 107.1 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | tiny | 24 | 97 | 324 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | base | 24 | 158 | 689 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | small | 24 | 437 | 2384 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | medium | 24 | 1301 | 8923 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | large | 24 | 2540 | 16748 | de49899 |
Old version, 120 threads (99% utilization)
Running ggml_mul_mat benchmark with 120 threads
ggml_mul_mat: 64 x 64: F16 0.0 GFLOPS ( 3 runs) / F32 0.0 GFLOPS ( 3 runs)
ggml_mul_mat: 128 x 128: F16 0.0 GFLOPS ( 3 runs) / F32 0.0 GFLOPS ( 3 runs)
ggml_mul_mat: 256 x 256: F16 0.0 GFLOPS ( 3 runs) / F32 0.0 GFLOPS ( 3 runs)
ggml_mul_mat: 512 x 512: F16 0.1 GFLOPS ( 3 runs) / F32 0.2 GFLOPS ( 3 runs)
ggml_mul_mat: 1024 x 1024: F16 1.6 GFLOPS ( 3 runs) / F32 1.2 GFLOPS ( 3 runs)
ggml_mul_mat: 2048 x 2048: F16 12.3 GFLOPS ( 3 runs) / F32 12.6 GFLOPS ( 3 runs)
ggml_mul_mat: 4096 x 4096: F16 68.5 GFLOPS ( 3 runs) / F32 50.5 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | tiny | 120 | 96 | 78836 | fd83fb2 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | base | 120 | 157 | 113952 | fd83fb2 |
A while it took, indeed. |
New version, 120 threads (90% utilization)
Running ggml_mul_mat benchmark with 120 threads
ggml_mul_mat: 64 x 64: F16 0.1 GFLOPS (106 runs) / F32 0.1 GFLOPS (106 runs)
ggml_mul_mat: 128 x 128: F16 0.4 GFLOPS (106 runs) / F32 0.4 GFLOPS (104 runs)
ggml_mul_mat: 256 x 256: F16 3.5 GFLOPS (106 runs) / F32 3.5 GFLOPS (104 runs)
ggml_mul_mat: 512 x 512: F16 25.1 GFLOPS ( 94 runs) / F32 25.7 GFLOPS ( 96 runs)
ggml_mul_mat: 1024 x 1024: F16 129.0 GFLOPS ( 61 runs) / F32 127.7 GFLOPS ( 60 runs)
ggml_mul_mat: 2048 x 2048: F16 248.5 GFLOPS ( 15 runs) / F32 179.5 GFLOPS ( 11 runs)
ggml_mul_mat: 4096 x 4096: F16 191.5 GFLOPS ( 3 runs) / F32 121.3 GFLOPS ( 3 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | tiny | 120 | 98 | 583 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | base | 120 | 158 | 972 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | small | 120 | 435 | 2588 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | medium | 120 | 1296 | 7457 | de49899 |
12th Gen Intel(R) Core(TM) i9-12900KS | Microsoft Windows 11 Pro | AVX2 | large | 120 | 2540 | 12715 | de49899 |
I also added a script that automatically runs all benchmarks on Windows. It is simply the shell script version converted into Powershell.
And here's the pthread
version. Now this should be mergeable, though as I wrote, I am planning further optimizations. macOS tables below.
Original version, 6 threads (75% utilization)
Running ggml_mul_mat benchmark with 6 threads
ggml_mul_mat: 64 x 64: F16 6.0 GFLOPS (128 runs) / F32 5.2 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 66.8 GFLOPS (128 runs) / F32 42.1 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 356.6 GFLOPS (128 runs) / F32 283.2 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 459.2 GFLOPS (128 runs) / F32 530.5 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 937.4 GFLOPS (128 runs) / F32 1379.4 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 1217.2 GFLOPS ( 71 runs) / F32 1557.5 GFLOPS ( 91 runs)
ggml_mul_mat: 4096 x 4096: F16 1695.6 GFLOPS ( 13 runs) / F32 1431.4 GFLOPS ( 11 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | tiny | 6 | 49 | 106 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | base | 6 | 64 | 196 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | small | 6 | 178 | 674 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | medium | 6 | 558 | 1940 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | large | 6 | 1246 | 3547 | fd83fb2 |
Original version, 10 threads (90% utilization)
Running ggml_mul_mat benchmark with 10 threads
ggml_mul_mat: 64 x 64: F16 3.7 GFLOPS (128 runs) / F32 3.0 GFLOPS (128 runs)
ggml_mul_mat: 128 x 128: F16 35.1 GFLOPS (128 runs) / F32 19.5 GFLOPS (128 runs)
ggml_mul_mat: 256 x 256: F16 223.6 GFLOPS (128 runs) / F32 130.8 GFLOPS (128 runs)
ggml_mul_mat: 512 x 512: F16 597.1 GFLOPS (128 runs) / F32 574.2 GFLOPS (128 runs)
ggml_mul_mat: 1024 x 1024: F16 773.1 GFLOPS (128 runs) / F32 528.9 GFLOPS (128 runs)
ggml_mul_mat: 2048 x 2048: F16 599.6 GFLOPS ( 35 runs) / F32 485.6 GFLOPS ( 29 runs)
ggml_mul_mat: 4096 x 4096: F16 1005.8 GFLOPS ( 8 runs) / F32 722.6 GFLOPS ( 6 runs)
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | tiny | 10 | 46 | 131 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | base | 10 | 65 | 286 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | small | 10 | 180 | 1105 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | medium | 10 | 526 | 3225 | fd83fb2 |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | large | 10 | 1237 | 5546 | fd83fb2 |
New version, 6 threads (45% utilization)
Running ggml_mul_mat benchmark with 10 threads
[deadlock of some kind, I'll have to look into this]
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | tiny | 6 | 49 | 121 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | base | 6 | 66 | 203 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | small | 6 | 175 | 667 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | medium | 6 | 507 | 1818 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | large | 6 | 1315 | 3249 | a6ee46f |
New version, 10 threads (50% utilization)
Running ggml_mul_mat benchmark with 10 threads
[deadlock of some kind, I'll have to look into this]
CPU | OS | Config | Model | Th | Load | Enc. | Commit |
---|---|---|---|---|---|---|---|
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | tiny | 10 | 42 | 129 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | base | 10 | 67 | 240 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | small | 10 | 176 | 744 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | medium | 10 | 545 | 1933 | a6ee46f |
Apple M1 Pro | macOS 13.2.1 (22D68) | NEON BLAS | large | 10 | 1419 | 3347 | a6ee46f |
I didn't expect performance on macOS to be so good. Anyway, it appears that for both versions, running with 6 threads is a performance sweet spot. The cool thing is that the new version uses about 40% less CPU while being about 9% faster.
This isn't ready yet, as, for some reason, the ggml_mul_mat
benchmark deadlocks now. I'll look into this.
I wasn't able to measure the energy impact because the Activity Monitor is useless in that regard.
I just saw that ggml.c
is copy-pasted to llama.cpp
. I'll see if it improves performance there.
@janekb04 This is very nice work! I've always had doubts that the existing spin-lock approach is not optimal, but my attempts of adding mutex and condition variables were giving worse performance overall.
I haven't tested and looked at the proposed changes, but the reported results look promising.
However, it is also important to measure the performance in the Decoder. It's different from the Encoder since there we don't rely on Accelerate's sgemm
and there are high-frequency ggml_mul_mat
calls for smaller matrices.
There is no existing benchmark for the Decoder, but you can simply run the transcription for some of the sample audio files and look at the reported time/per at the end.
I will take a more detailed look in the following days.
I just saw that
ggml.c
is copy-pasted tollama.cpp
. I'll see if it improves performance there.
The ggml.c
in llama.cpp
has some new extra stuff added and I haven't yet synchronized it with whisper.cpp
.
You won't be able to copy paste the ggml.c
from here into llama.cpp
- they are incompatible atm.
I will fix this soon.
For now, you can just re-apply your changes to the ggml.c
in llama.cpp
to see how is the performance there
Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.
While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.
Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.
Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.
While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.
Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.
I am currently developing a realtime await-async
system for C++ that works just like described here. It has even better latency because a job-switch there is on the order of tens of nanoseconds. However it is rather early stage and very unstable. There are also existing systems that work like this (a few are in boost - asio, coroutines, fibers). Unfortunately, they are for C++, which allows for some nice syntactic sugar that wouldn't be possible in C (especially mine, as I overload the co_await
operators. It has a bloated meta programming implementation but the user code looks very similar to Python or JavaScript). I don't know if it would be feasible to introduce that here.
As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some N
threads. Then, it creates the "compute graph". I assume that it is a DAG with each node representing some computation. I assume that it is topologically sorted before the main for (int i = 0; i < cgraph->n_nodes; i++)
loop. The loop sequentially goes through all the nodes. If the "compute graph" is indeed a sorted DAG, then, here comes the optimization: instead of going "for node in graph: for task in node:
", the tasks from nondependent nodes could be run independently and each node. This would mean that fundamentally, the code would work like:
Main thread:
compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
// The number of incoming edges
// ie. the number of dependencies
if (n.dependency_count.nonatomic_load() > 0)
break;
Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result]
Worker threads execute execute_work
function:
Q.wait_for_start_working_blocking();
while (!Q.done()) {
task to_do = Q.pop_blocking();
execute(to_do);
// if this was the last task for this node, the node has completed
if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
// so, all the node's dependents have one dependency less
for (node& n : to_do.node.dependents) {
// if the current node was the last dependency of this node
// we can enqueue this node's tasks for execution
if (n.dependency_count.atomic_fetch_sub(1) == 1) {
Q.batch_enqueue(n.tasks);
}
}
}
}
This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work.
There are also a few minor things here and there. One I found is alloca
, here:
struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;
using alloca
is confusing for the compiler. It no longer has a function frame with locals positioned at deterministic addresses. Instead, it has to do more address computations that depend on the allocated memory block size.
Hi there -- Do you think your original changes will still work with llama.cpp backported updates? It would be pretty cool to have two strong performance improvements in a row!
@JKeddo95 I took my time to read through the changes and pulled them in and as far as I can tell, this PR is still valid.
To me this looks to be a very clean and well thought-out PR. I fully agree with the implementation provided and think this is absolutely the proper way to go.
There are multiple ways to implement locking, but the lightweight mutexes used here are the best option in most cases.
Spinlocks are rarely the right option, namely only when at least one of these conditions apply:
- It's mission critical to release a lock as fast as possible, where the when needs to have nanosecond precision
- If thread context switching needs to be unequivocally disallowed
- When locking/sleeping/spinning very often and for very short (< ~1us) periods of time, where the cost (execution time) of context switch is larger than the time spent spinning
When even the CPU manufacturers themselves since at least 2011 advise against using spinlocks unless necessary, I'd take their word for it. But the beauty here is that we don't even have to take their word, when your performance tests confirm this to be true. Maybe 15 years ago there was a point in time where not letting the CPU sleep and doing stuff like disabling C-states in BIOS could improve performance, this hasn't been the case for a long time now. Processor tech has developed a lot since then and nowadays they perform better by letting them sleep and giving them headroom to manage the work. Especially now when we're reaching the tail-end of Moore's law, the thermal and power limits are more of an issue than ever before and processors are pushing what's possible to the limits and absolutely do gain performance from the ability to 'breathe' by decreasing the power use==thermal output of program code with methods like this (not wasting cycles by unnecessarily spinning and keeping the thread locked with usage at 100%).
Pros:
- Uses lightweight mutexes which is the best locking option for this use-case (and most use-cases for that matter)
- Uses the low-level locking mechanisms of POSIX threads and Windows critical sections
- Doesn't use high-level abstractions like STL std::lock , which decrease performance by adding unnecessary code only to end up calling the same low-level functions at the end anyway. Having less abstraction also means easier low-level debugging,
- Does all this in a minimal amount of code
- For this use case I don't see how this implementation could have a performance loss in any configuration, there should be only gains. The gains should scale up when having less thermal (= power) headroom, with inadequately cooled and power-hungry configurations gaining the most. On a x86 laptop you'd have both situations, so those stand to gain the most.
Cons:
- None that I can think of
Theres some more in-depth discussion about thread locking over at the llama.cpp on this (now abandoned) PR: [llama.cpp] ggml: refactor compute thread: merge three spin variables into one #816
It's a long thread and not necessarily everything applies, for example the proposition I made there about adding an #ifdef option for the different lock/sleep conditions I no longer agree with after having a second thought, as it would add unnecessary code complexity with no real advantages. I also made the point of being mindful about the cost of context switching, but this is clearly a non-issue here. To be clear there's nothing to be added from that discussion to this PR, but there is some good context and information for those interested to dig in deeper.
All in all, presentation-wise this is one of the best PR's I've ever seen anywhere in terms of reasoning and testing, especially the well laid-out performance tests along multiple architectures and operating systems is just perfect, more than can be reasonably expected from anyone. In fact, I am bookmarking this PR as an example on the subject of "How to make a perfect PR".
Few clarifications I would like to ask about though:
This is a draft because I haven't implemented the lock using
pthreads
yet, ...
You have now, so the PR description could be edited to reflect this?
... and the current Windows implementation is rather naive and suboptimal.
Can you elaborate on this? Looking through the code it looks like you have the most optimalest solution. Regular mutexes are kernel objects which require the thread to switch usermode->kernelmode->usermode whenever they are used unlike critical sections which can stay in usermode and afaik this is the reason they are so much faster. The CriticalSection/Condition paradigm is the fastest way to implement locking on Windows outside of spinlocks (which I wouldn't really call a mechanism anyway). The paradigm is widely used in the low-level OS and kernel, and my line of thinking usually goes that if something is good enough for low-level OS/kernel code, it's probably good full-stop.
I am also yet to optimize the computations themselves.
As that is a different beast altogether, I'd say better go ahead and merge this and not keep it unnecessarily blocked waiting further developments as on its' own this PR looks ready to go? In my opinion more & smaller PRs is a better option over less & larger ones anyway, as it makes maintainability easier and it's easier to revert small pieces if necessary when something goes wrong.
@anzz1 Thanks for the thoughtful evaluation. I updated the PR description.
Regarding the naïvety, I wrote that because I literally coded this off the top of my head, in one sitting. I just opened ggml.c
, searched for "lock" and 2-3 hours later I was done. So I didn't really want to call this "optimal" or "complete" as it was something I quickly hacked. Also, I thought that it could be better to use a completely different scheduling approach, but that would be indeed a "beast" of a rework. So, yes, it looks to be mergable.
@ggerganov @janekb04 what would this require to get pulled into master? happy to take @janekb04 's work and clean it up / test it if that is all that is required.
I tried this and it happens to be way slower in my setup:
CPU: 6-core AMD Ryzen 5 5560U with Radeon Graphics (-MT MCP-) speed/min/max: 2459/1600/4061 MHz
Kernel: 6.2.0-34-generic x86_64 Up: 47m Mem: 5465.5/12879.6 MiB (42.4%)
Storage: 465.76 GiB (5.9% used) Procs: 304 Shell: Bash inxi: 3.3.25
In my tests, it goes from around 75s to process a 60 second audio to 103/110s (in 4 threads). Running top shows that the usage moves between 250% to 380%.
When running 6 threads, it goes between 250% and 560%, and it takes 114s.
The commit id in my logs shows 7a5a5fe86dfd9c3566b2c584a7553596bdae68ac
.
Am I doing something wrong or dit this branch get outdated?