whisper.cpp
whisper.cpp copied to clipboard
Diminishing returns with increasing number of threads
It seems like 7 threads is a sweet-spot after which performance starts decreasing:
Is this expected?
Latest build from the GitHub Workflows
Windows 21H2
AMD 3700X
How many CPUs do you have? It might be that once you are using all of your cores you start to lose performance from excess thread switching.
@j-f1 , Ryzen 3700X has 8/16 cores and threads respectively.
I know after 4 threads, performance getting lower quickly on 8 cores/16t machine I am using.
On Tue, 29 Nov 2022, 06:43 savchenko, @.***> wrote:
@j-f1 https://github.com/j-f1 , Ryzen 3700X has 8/16 cores and threads respectively.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/200#issuecomment-1330070536, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5JAEDXSZXJHR46PQOXBWDWKWCYLANCNFSM6AAAAAASOA5X5U . You are receiving this because you are subscribed to this thread.Message ID: @.***>
@RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.
Yes, there must be some computational limitation, I guess. But anyway, it's very good at the end.
On Tue, Nov 29, 2022 at 7:27 AM savchenko @.***> wrote:
@RYucel https://github.com/RYucel , as you can see from the graph above, there is still a benefit of ~250ms from increasing number of threads from 4 to 6. Anything higher is indeed pointless.
— Reply to this email directly, view it on GitHub https://github.com/ggerganov/whisper.cpp/issues/200#issuecomment-1330096393, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF5JAEDI3OLLXTBD53QYIVLWKWH2LANCNFSM6AAAAAASOA5X5U . You are receiving this because you were mentioned.Message ID: @.***>
-- Rüştü YÜCEL Planning Expert M.Financial&Actuarial Engineering State Planning Organization-North Cyprus
@savchenko
Yes, I observe the same behaviour on M1 Pro - 7 threads is the sweet spot. Thanks for pointing out - I actually thought that 8 threads is best.
My explanation is that the computation becomes memory-bound at some point, so you stop gaining performance with more CPU power. It's the memory that limits us.
I've been running some tests under Superluminal, and I believe I'm seeing some waste when running on multiple threads.
The way ggml works, it spawns new threads every time ggml_graph_compute is invoked, but in some cases in whisper_cpp this gets pretty bad, especially on whisper_decode. For example, here's how one of those invocations looks like in Superluminal:

The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work. The rest is calls to atomic_load, or overhead from creating and destroying the thread.
It looks like trying to make these threads longer-lived and using a lighter synchronization mechanism should bring some nice perf gains here.
@jonvaldes
Thanks for this analysis!
I guess I will have to make the threads wait on a condition variable instead of joining them when the ggml_graph_compute finishes.
Regarding the atomic_load - once the threads are started, I found that using a busy loop on atomic counter is much more efficient compared to waiting + notify a condition variable. Of course, it is probably more energy wasteful, but since I am more interested in performance it was better. I think I can add a "low-power" mode where instead of busy loops we use the standard mechanism with condition variable. Would make the CPU go less crazy.
@savchenko Which model did you use and what was the duration of the audio segment used for testing?
@debasish-mihup, medium.en and "long enough to run for many minutes".
@ggerganov I profile it with FlameGraph, on my linux host.
with thread 8, you can see that ggml_compute_forward_mul_mat only use about 24.71% cpu time, but 72.53%(97.24 - 24.71) cpu time is wasted, I suspect this is the reason why metal don't work as expect, it's not the bottleneck.
I'm not familiar with C++, but from the code I guess decrease the thread number can help reduce the busy waiting time.
here is the thread 4 FlameGraph, you can see that now ggml_compute_forward_mul_mat 63.21% is doing actual work, only 32.19% (95.4 - 63.21) cpu time is busy waiting,
The thread only lives for 2.7ms (which is already worrying, as there's thousands of these threads being spawned), but of that time, only about 1ms is being spent on actual work.
Does the length of input affect the quality of output? Would not be a more efficient solution to stop creating these micro-threads and basically split the input audio to several segments and let each segment to be munched by a separate thread? (Of course, the results would have to be pasted together from different threads,)
These are times from my CPU (AMD Ryzen 5 3600 with 6 cores / 12 threads) with different number of threads: 1 779815.88 2 441046.56 4 277384.97 6 252671.91 8 236560.52 10 214721.44 11 203417.19 12 208065.34
Two parallel tasks: 6+6 183298.86 ms
I suppose it should be possible to get much closer to the ideal time (779 815 ms / 12 threads = 64 984 ms). It would just require to find the right place where to cut the original audio without splitting any word. Actually, skipping silent parts (audio gate) would also help.
I tried to eliminate thread creation/joining in https://github.com/ggerganov/whisper.cpp/pull/343 but performance did not improve. My hypothesis is that mutex locks are actually very expensive - more expensive than creating and joining threads. But not sure if I am correct ..
I agree that there is a lot of performance to be gained in the Decoder. The ggml_graph_compute is called many many times and there is significant overhead from these calls. But I don't know what is the best way to improve this yet
There's something very wrong with the multithreading support. I have a Ryzen 5950X (16 cores, 32 hardware threads). Setting n_threads = 16 gives inference times (2 trials performed): 5.22 s, 4.99 s. Setting n_threads = 32 gives inference times: 124.09 s 196.79 s
Something like 80% of the total computation time is spent in ggml_graph_compute_thread calling atomic_load.
I collected data on two of the many-core server systems I have in my lab, both aarch64. I used a Chinese audio file which is 73 seconds long, and tested with the latest mainline build using a quantized int-5 model:
./main -t 1 -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav
The Huawei machine has 48 cores on an SoC, and the Ampere machine has 80 cores on an SoC. Neither has SMT. I ran a few different trials and took the best time for each thread count. The best time on the Huawei was with 13 threads, and for Ampere it was at 20 threads.
The Ampere machine has large private L2 caches; when we bind the threads so the OS doesn't schedule them all over the place, we retain hot caches (for data and locks) which leads to better CPU usage. Although that is for the region on the right after we have already hit our minimum with 16 threads. Using 80 threads is twice as slow as using 16 threads. Maybe there just isn't enough work to create efficiency past 16 threads? Are there knobs to partition the work at coarser granularity per thread?
for i in `seq 1 80`; do num=$((i-1)); time=`perf stat taskset -c 0-$num ./main -l chinese -d 73300 -bo 5 -m models/ggml-model-whisper-base-q5_1.bin -f Tencent-chinese.wav -t $i |& grep seconds\ time | awk '{print $1}'`; echo "$i,$time" >> stats.csv