whisper.cpp icon indicating copy to clipboard operation
whisper.cpp copied to clipboard

"Double" the performance

Open janekb04 opened this issue 1 year ago • 15 comments

Well, it uses 50% less power - that's "double" performance. Basically, instead of using spinlocks, I made whisper.cpp use condition variables with mutices (mutex plural?).

Whisper is not really a low latency system. This means that busy locks aren't the best choice for synchronisation. The more so, that whisper.cpp is also supposed to run on the web and on mobile devices, where users usually care about power usage. In this PR, I made whisper.cpp use the classical conditional variable + mutex lock schema instead. On a 12900KS without overclocking, this reduces the CPU usage (and hence the power consumption) by half. On the other hand, if we go for full 100% utilization, the computation time is reduced by about 25%. Performance tables below.

This is a draft because I haven't implemented the lock using pthreads yet, and the current Windows implementation is rather naive and suboptimal. I am also yet to optimize the computations themselves.

janekb04 avatar Mar 26 '23 20:03 janekb04

Original version, 24 threads (95% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.3 GFLOPS (128 runs) / F32      0.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      2.7 GFLOPS (128 runs) / F32      1.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     10.1 GFLOPS (128 runs) / F32     19.1 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     61.7 GFLOPS (128 runs) / F32     74.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    138.7 GFLOPS ( 65 runs) / F32    162.6 GFLOPS ( 76 runs)
ggml_mul_mat:  2048 x  2048: F16    184.7 GFLOPS ( 11 runs) / F32    192.5 GFLOPS ( 12 runs)
ggml_mul_mat:  4096 x  4096: F16    174.4 GFLOPS (  3 runs) / F32     94.8 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 tiny 24 98 374 fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 base 24 153 1023 fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 small 24 437 2896 fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 medium 24 1301 8510 fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 large 24 2563 16643 fd83fb2

New version, 24 threads (50% utilization)

Running ggml_mul_mat benchmark with 24 threads

ggml_mul_mat:    64 x    64: F16      0.4 GFLOPS (128 runs) / F32      0.4 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16      3.0 GFLOPS (128 runs) / F32      2.9 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16     21.4 GFLOPS (128 runs) / F32     20.3 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16     95.0 GFLOPS (128 runs) / F32    100.3 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    180.4 GFLOPS ( 84 runs) / F32    203.9 GFLOPS ( 95 runs)
ggml_mul_mat:  2048 x  2048: F16    207.3 GFLOPS ( 13 runs) / F32    179.8 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    182.9 GFLOPS (  3 runs) / F32    107.1 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 tiny 24 97 324 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 base 24 158 689 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 small 24 437 2384 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 medium 24 1301 8923 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 large 24 2540 16748 de49899

Old version, 120 threads (99% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   128 x   128: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   256 x   256: F16      0.0 GFLOPS (  3 runs) / F32      0.0 GFLOPS (  3 runs)
ggml_mul_mat:   512 x   512: F16      0.1 GFLOPS (  3 runs) / F32      0.2 GFLOPS (  3 runs)
ggml_mul_mat:  1024 x  1024: F16      1.6 GFLOPS (  3 runs) / F32      1.2 GFLOPS (  3 runs)
ggml_mul_mat:  2048 x  2048: F16     12.3 GFLOPS (  3 runs) / F32     12.6 GFLOPS (  3 runs)
ggml_mul_mat:  4096 x  4096: F16     68.5 GFLOPS (  3 runs) / F32     50.5 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 tiny 120 96 78836 fd83fb2
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 base 120 157 113952 fd83fb2
A while it took, indeed.

New version, 120 threads (90% utilization)

Running ggml_mul_mat benchmark with 120 threads

ggml_mul_mat:    64 x    64: F16      0.1 GFLOPS (106 runs) / F32      0.1 GFLOPS (106 runs)
ggml_mul_mat:   128 x   128: F16      0.4 GFLOPS (106 runs) / F32      0.4 GFLOPS (104 runs)
ggml_mul_mat:   256 x   256: F16      3.5 GFLOPS (106 runs) / F32      3.5 GFLOPS (104 runs)
ggml_mul_mat:   512 x   512: F16     25.1 GFLOPS ( 94 runs) / F32     25.7 GFLOPS ( 96 runs)
ggml_mul_mat:  1024 x  1024: F16    129.0 GFLOPS ( 61 runs) / F32    127.7 GFLOPS ( 60 runs)
ggml_mul_mat:  2048 x  2048: F16    248.5 GFLOPS ( 15 runs) / F32    179.5 GFLOPS ( 11 runs)
ggml_mul_mat:  4096 x  4096: F16    191.5 GFLOPS (  3 runs) / F32    121.3 GFLOPS (  3 runs)
CPU OS Config Model Th Load Enc. Commit
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 tiny 120 98 583 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 base 120 158 972 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 small 120 435 2588 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 medium 120 1296 7457 de49899
12th Gen Intel(R) Core(TM) i9-12900KS Microsoft Windows 11 Pro AVX2 large 120 2540 12715 de49899

janekb04 avatar Mar 26 '23 20:03 janekb04

I also added a script that automatically runs all benchmarks on Windows. It is simply the shell script version converted into Powershell.

janekb04 avatar Mar 26 '23 20:03 janekb04

And here's the pthread version. Now this should be mergeable, though as I wrote, I am planning further optimizations. macOS tables below.

janekb04 avatar Mar 26 '23 21:03 janekb04

Original version, 6 threads (75% utilization)

Running ggml_mul_mat benchmark with 6 threads

ggml_mul_mat:    64 x    64: F16      6.0 GFLOPS (128 runs) / F32      5.2 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     66.8 GFLOPS (128 runs) / F32     42.1 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    356.6 GFLOPS (128 runs) / F32    283.2 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    459.2 GFLOPS (128 runs) / F32    530.5 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    937.4 GFLOPS (128 runs) / F32   1379.4 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16   1217.2 GFLOPS ( 71 runs) / F32   1557.5 GFLOPS ( 91 runs)
ggml_mul_mat:  4096 x  4096: F16   1695.6 GFLOPS ( 13 runs) / F32   1431.4 GFLOPS ( 11 runs)
CPU OS Config Model Th Load Enc. Commit
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS tiny 6 49 106 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS base 6 64 196 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS small 6 178 674 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS medium 6 558 1940 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS large 6 1246 3547 fd83fb2

Original version, 10 threads (90% utilization)

Running ggml_mul_mat benchmark with 10 threads

ggml_mul_mat:    64 x    64: F16      3.7 GFLOPS (128 runs) / F32      3.0 GFLOPS (128 runs)
ggml_mul_mat:   128 x   128: F16     35.1 GFLOPS (128 runs) / F32     19.5 GFLOPS (128 runs)
ggml_mul_mat:   256 x   256: F16    223.6 GFLOPS (128 runs) / F32    130.8 GFLOPS (128 runs)
ggml_mul_mat:   512 x   512: F16    597.1 GFLOPS (128 runs) / F32    574.2 GFLOPS (128 runs)
ggml_mul_mat:  1024 x  1024: F16    773.1 GFLOPS (128 runs) / F32    528.9 GFLOPS (128 runs)
ggml_mul_mat:  2048 x  2048: F16    599.6 GFLOPS ( 35 runs) / F32    485.6 GFLOPS ( 29 runs)
ggml_mul_mat:  4096 x  4096: F16   1005.8 GFLOPS (  8 runs) / F32    722.6 GFLOPS (  6 runs)
CPU OS Config Model Th Load Enc. Commit
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS tiny 10 46 131 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS base 10 65 286 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS small 10 180 1105 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS medium 10 526 3225 fd83fb2
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS large 10 1237 5546 fd83fb2

New version, 6 threads (45% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]
CPU OS Config Model Th Load Enc. Commit
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS tiny 6 49 121 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS base 6 66 203 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS small 6 175 667 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS medium 6 507 1818 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS large 6 1315 3249 a6ee46f

New version, 10 threads (50% utilization)

Running ggml_mul_mat benchmark with 10 threads

[deadlock of some kind, I'll have to look into this]
CPU OS Config Model Th Load Enc. Commit
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS tiny 10 42 129 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS base 10 67 240 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS small 10 176 744 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS medium 10 545 1933 a6ee46f
Apple M1 Pro macOS 13.2.1 (22D68) NEON BLAS large 10 1419 3347 a6ee46f

janekb04 avatar Mar 26 '23 21:03 janekb04

I didn't expect performance on macOS to be so good. Anyway, it appears that for both versions, running with 6 threads is a performance sweet spot. The cool thing is that the new version uses about 40% less CPU while being about 9% faster.

janekb04 avatar Mar 26 '23 21:03 janekb04

This isn't ready yet, as, for some reason, the ggml_mul_mat benchmark deadlocks now. I'll look into this.

janekb04 avatar Mar 26 '23 21:03 janekb04

I wasn't able to measure the energy impact because the Activity Monitor is useless in that regard.

janekb04 avatar Mar 26 '23 22:03 janekb04

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

janekb04 avatar Mar 26 '23 22:03 janekb04

@janekb04 This is very nice work! I've always had doubts that the existing spin-lock approach is not optimal, but my attempts of adding mutex and condition variables were giving worse performance overall.

I haven't tested and looked at the proposed changes, but the reported results look promising. However, it is also important to measure the performance in the Decoder. It's different from the Encoder since there we don't rely on Accelerate's sgemm and there are high-frequency ggml_mul_mat calls for smaller matrices.

There is no existing benchmark for the Decoder, but you can simply run the transcription for some of the sample audio files and look at the reported time/per at the end.

I will take a more detailed look in the following days.

ggerganov avatar Mar 27 '23 05:03 ggerganov

I just saw that ggml.c is copy-pasted to llama.cpp. I'll see if it improves performance there.

The ggml.c in llama.cpp has some new extra stuff added and I haven't yet synchronized it with whisper.cpp. You won't be able to copy paste the ggml.c from here into llama.cpp - they are incompatible atm.

I will fix this soon. For now, you can just re-apply your changes to the ggml.c in llama.cpp to see how is the performance there

ggerganov avatar Mar 27 '23 05:03 ggerganov

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

anzz1 avatar Mar 30 '23 13:03 anzz1

Mutexes and events that signal them are best for the most cases, but they usually aren't very fast nor precise, but if the when isn't important they would offer best performance. However, mutex/event and spinlock aren't the only options, there is also sleep/yield.

While not directly related, here's some research I did and an example implementation on using sleep-yield in place of spinlocking where latency and accuracy was utmost priority (need stable frametimes) and at least on Windows it was still accurate to the point where I couldn't even measure below that (~10μs / ~0.01ms , even the call to QueryPerformanceCounter takes ~1μs so ...) , being way faster and more accurate than actually needed for that use case while using basically zero power being an order of unknown amount of magnitudes more efficient (so less power hungry) than the spinlock alternative.

Didn't really look what's the problem or if this is applicable, but just dropped this info if someone happens to need it as the "third option" isn't as readily found by a simple google search.

I am currently developing a realtime await-async system for C++ that works just like described here. It has even better latency because a job-switch there is on the order of tens of nanoseconds. However it is rather early stage and very unstable. There are also existing systems that work like this (a few are in boost - asio, coroutines, fibers). Unfortunately, they are for C++, which allows for some nice syntactic sugar that wouldn't be possible in C (especially mine, as I overload the co_await operators. It has a bloated meta programming implementation but the user code looks very similar to Python or JavaScript). I don't know if it would be feasible to introduce that here.

As far as I understand the code, the current work scheduling is less than ideal. The main thread launches some N threads. Then, it creates the "compute graph". I assume that it is a DAG with each node representing some computation. I assume that it is topologically sorted before the main for (int i = 0; i < cgraph->n_nodes; i++) loop. The loop sequentially goes through all the nodes. If the "compute graph" is indeed a sorted DAG, then, here comes the optimization: instead of going "for node in graph: for task in node:", the tasks from nondependent nodes could be run independently and each node. This would mean that fundamentally, the code would work like:

Main thread:

compute_graph G; // topologically-sorted
multithreaded_queue<task> Q;
for (node& n : G) {
    // The number of incoming edges
    // ie. the number of dependencies
    if (n.dependency_count.nonatomic_load() > 0)
        break;
    Q.batch_enqueue(n.tasks);
}
Q.start_working();
execute_work()
// cleanup
return [the result]

Worker threads execute execute_work function:

Q.wait_for_start_working_blocking();
while (!Q.done()) {
    task to_do = Q.pop_blocking();
    execute(to_do);
    
    // if this was the last task for this node, the node has completed
    if(to_do.node.task_count.atomic_fetch_sub(1) == 1) {
        // so, all the node's dependents have one dependency less
        for (node& n : to_do.node.dependents) {
             // if the current node was the last dependency of this node
             // we can enqueue this node's tasks for execution
             if (n.dependency_count.atomic_fetch_sub(1) == 1) {
                 Q.batch_enqueue(n.tasks);
             }
        }
    }
}

This design should eliminate all the blocking and waiting and maximize the amount of time spent by the threads on executing useful work.

janekb04 avatar Apr 02 '23 13:04 janekb04

There are also a few minor things here and there. One I found is alloca, here:

struct ggml_compute_state * workers = n_threads > 1 ? alloca(sizeof(struct ggml_compute_state)*(n_threads - 1)) : NULL;  

using alloca is confusing for the compiler. It no longer has a function frame with locals positioned at deterministic addresses. Instead, it has to do more address computations that depend on the allocated memory block size.

janekb04 avatar Apr 02 '23 13:04 janekb04

Hi there -- Do you think your original changes will still work with llama.cpp backported updates? It would be pretty cool to have two strong performance improvements in a row!

JKeddo95 avatar Apr 13 '23 15:04 JKeddo95

@JKeddo95 I took my time to read through the changes and pulled them in and as far as I can tell, this PR is still valid.

janekb04 avatar Apr 17 '23 08:04 janekb04

To me this looks to be a very clean and well thought-out PR. I fully agree with the implementation provided and think this is absolutely the proper way to go.

There are multiple ways to implement locking, but the lightweight mutexes used here are the best option in most cases.

Spinlocks are rarely the right option, namely only when at least one of these conditions apply:

  • It's mission critical to release a lock as fast as possible, where the when needs to have nanosecond precision
  • If thread context switching needs to be unequivocally disallowed
  • When locking/sleeping/spinning very often and for very short (< ~1us) periods of time, where the cost (execution time) of context switch is larger than the time spent spinning

When even the CPU manufacturers themselves since at least 2011 advise against using spinlocks unless necessary, I'd take their word for it. But the beauty here is that we don't even have to take their word, when your performance tests confirm this to be true. Maybe 15 years ago there was a point in time where not letting the CPU sleep and doing stuff like disabling C-states in BIOS could improve performance, this hasn't been the case for a long time now. Processor tech has developed a lot since then and nowadays they perform better by letting them sleep and giving them headroom to manage the work. Especially now when we're reaching the tail-end of Moore's law, the thermal and power limits are more of an issue than ever before and processors are pushing what's possible to the limits and absolutely do gain performance from the ability to 'breathe' by decreasing the power use==thermal output of program code with methods like this (not wasting cycles by unnecessarily spinning and keeping the thread locked with usage at 100%).

Pros:

  • Uses lightweight mutexes which is the best locking option for this use-case (and most use-cases for that matter)
  • Uses the low-level locking mechanisms of POSIX threads and Windows critical sections
  • Doesn't use high-level abstractions like STL std::lock , which decrease performance by adding unnecessary code only to end up calling the same low-level functions at the end anyway. Having less abstraction also means easier low-level debugging,
  • Does all this in a minimal amount of code
  • For this use case I don't see how this implementation could have a performance loss in any configuration, there should be only gains. The gains should scale up when having less thermal (= power) headroom, with inadequately cooled and power-hungry configurations gaining the most. On a x86 laptop you'd have both situations, so those stand to gain the most.

Cons:

  • None that I can think of

 

Theres some more in-depth discussion about thread locking over at the llama.cpp on this (now abandoned) PR: [llama.cpp] ggml: refactor compute thread: merge three spin variables into one #816

It's a long thread and not necessarily everything applies, for example the proposition I made there about adding an #ifdef option for the different lock/sleep conditions I no longer agree with after having a second thought, as it would add unnecessary code complexity with no real advantages. I also made the point of being mindful about the cost of context switching, but this is clearly a non-issue here. To be clear there's nothing to be added from that discussion to this PR, but there is some good context and information for those interested to dig in deeper.

All in all, presentation-wise this is one of the best PR's I've ever seen anywhere in terms of reasoning and testing, especially the well laid-out performance tests along multiple architectures and operating systems is just perfect, more than can be reasonably expected from anyone. In fact, I am bookmarking this PR as an example on the subject of "How to make a perfect PR".

 

Few clarifications I would like to ask about though:

This is a draft because I haven't implemented the lock using pthreads yet, ...

You have now, so the PR description could be edited to reflect this?

... and the current Windows implementation is rather naive and suboptimal.

Can you elaborate on this? Looking through the code it looks like you have the most optimalest solution. Regular mutexes are kernel objects which require the thread to switch usermode->kernelmode->usermode whenever they are used unlike critical sections which can stay in usermode and afaik this is the reason they are so much faster. The CriticalSection/Condition paradigm is the fastest way to implement locking on Windows outside of spinlocks (which I wouldn't really call a mechanism anyway). The paradigm is widely used in the low-level OS and kernel, and my line of thinking usually goes that if something is good enough for low-level OS/kernel code, it's probably good full-stop.

I am also yet to optimize the computations themselves.

As that is a different beast altogether, I'd say better go ahead and merge this and not keep it unnecessarily blocked waiting further developments as on its' own this PR looks ready to go? In my opinion more & smaller PRs is a better option over less & larger ones anyway, as it makes maintainability easier and it's easier to revert small pieces if necessary when something goes wrong.

anzz1 avatar Apr 18 '23 20:04 anzz1

@anzz1 Thanks for the thoughtful evaluation. I updated the PR description.

Regarding the naïvety, I wrote that because I literally coded this off the top of my head, in one sitting. I just opened ggml.c, searched for "lock" and 2-3 hours later I was done. So I didn't really want to call this "optimal" or "complete" as it was something I quickly hacked. Also, I thought that it could be better to use a completely different scheduling approach, but that would be indeed a "beast" of a rework. So, yes, it looks to be mergable.

janekb04 avatar Apr 19 '23 07:04 janekb04

@ggerganov @janekb04 what would this require to get pulled into master? happy to take @janekb04 's work and clean it up / test it if that is all that is required.

nchudleigh avatar Aug 29 '23 02:08 nchudleigh

I tried this and it happens to be way slower in my setup:

CPU: 6-core AMD Ryzen 5 5560U with Radeon Graphics (-MT MCP-) speed/min/max: 2459/1600/4061 MHz
Kernel: 6.2.0-34-generic x86_64 Up: 47m Mem: 5465.5/12879.6 MiB (42.4%)
Storage: 465.76 GiB (5.9% used) Procs: 304 Shell: Bash inxi: 3.3.25

In my tests, it goes from around 75s to process a 60 second audio to 103/110s (in 4 threads). Running top shows that the usage moves between 250% to 380%.

When running 6 threads, it goes between 250% and 560%, and it takes 114s.

The commit id in my logs shows 7a5a5fe86dfd9c3566b2c584a7553596bdae68ac.

Am I doing something wrong or dit this branch get outdated?

lilezek avatar Oct 12 '23 14:10 lilezek