llama.cpp
llama.cpp copied to clipboard
OpenBLAS Linux : more thread = less performances
Hi,
My config with Intel Core i5 13600kf :
-
OS : Linux Mint 21.3, 64 bits, with kernel 6.5.0
-
CPU: Intel Core i5 13600kf
- P-Cores : 6 (hyper threaded)
- E-Cores : 8
-
SRAM : 64 GB
-
SSD : nvme
-
GPU : Nvidia RTX 4070 Ti Super (but not involved in the issue)
-
libopenblas-dev
version0.3.20
(edit : also tested withlibopenblas64-dev
)
llama.cpp version :
Today's freshly compiled source code.
The issue when compiled with OpenBLAS :
While benchmarking using both ./example/benchmark
and ./example/main
, I found there is an issue when llama.cpp is compiled with OpenBLAS : More threads = less performances (and more power consumption measured using a watt-meter).
./llama.cpp.openblas/benchmark -t %
Random guess : Is it possible that OpenBLAS is already multi-threaded and that llama.cpp try to multi-thread it again ?
Comparison with vanilla version :
For comparison, here is what i get with vanilla compiled version :
./llama.cpp.CPU/benchmark -t %
Note that the decrease of performance after
-t 6
is due to the fact that the CPU, which has only 6 P-cores, starts using its E-cores that are waaaaaaaaay slower.
Comparison with cuBLAS version (Nvidia RTX 4070 Ti Super):
Note that, here again, the decrease of performance after -t 6
is due to the fact that the CPU, which has only 6 P-cores, starts using its E-cores that are waaaaaaaaay slower.
Note : I've redone the tests with all the E-cores disabled, and the issue remains the same : more threads = less performances.
EDIT : and during the prompt eval time, OpenBLAS make use of as much threads as there are cores reported by the OS, even with -t 1
(so OpenBLAS is multi-threaded by default).
I've redone the test using OpenBLAS-64 (instead if the 32bits version), and the issue remain the same.
Below is an other test made using my laptop CPU :
My config with AMD Ryzen 5 5500u :
-
OS : Linux Mint 21.2, 64 bits, with kernel 6.5.0
-
CPU: AMD Ryzen 5 5500u (laptop CPU)
- Cores : 6 (hyper threaded)
-
SRAM : 64 GB
-
SSD : nvme
-
GPU : embeded AMD Radeon iGPU (not involved in this issue)
-
libopenblas64-dev
version0.3.20
llama.cpp version :
Today's freshly compiled source code.
The issue when compiled with OpenBLAS :
Same issue : More threads = less performances.
./llama.cpp.openblas/benchmark -t %
Comparison with vanilla version :
./llama.cpp.CPU/benchmark -t %
Note that the decrease of performance after -t 6
is due to the fact that the CPU only has 6 cores and starts using Hyper-threading for the next threads.
My config with Intel Core i3-12100f :
-
OS : Linux Mint 21.2, 64 bits, with kernel 6.5.0
-
CPU: Intel Core i3-12100f
- P-Cores : 4 (hyper threaded)
- E-Cores : 0
-
SRAM : 16 GB
-
SSD : hdd
-
GPU : Nvidia RTX 4060 (but not involved in the issue)
-
libopenblas-dev
version0.3.20
(edit : also tested withlibopenblas64-dev
)
llama.cpp version :
Today's freshly compiled source code.
The issue when compiled with OpenBLAS :
When llama.cpp is compiled with OpenBLAS : More threads = less performances.
./llama.cpp.openblas/benchmark -t %
Comparison with vanilla version :
./llama.cpp.CPU/benchmark -t %
Note that the decrease of performance after -t 4
is due to the fact that the CPU, which has only 4 cores, starts using hyper-threading
Comparison with cuBLAS version (Nvidia RTX 4060) :
This issue seems to greatly affect imatrix computations as well — although I didn't run the full computations and just went by the ETA. These numbers are from an EC2 c5ad.8xlarge instance which is allocated 16 cores/32 threads of an AMD EPYC 7R32. The programs were linked with libopenblas-dev
from Ubuntu Server 22.04 (jammy).
openblas, --threads 32
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 10149.1 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 31.88 seconds per pass - ETA 26.55 minutes
openblas, --threads 1
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 10191.4 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 14.95 seconds per pass - ETA 12.45 minutes
no blas, --threads 32
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 9793.42 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 5.62 seconds per pass - ETA 4.68 minutes
This issue was closed because it has been inactive for 14 days since being marked as stale.
@ggerganov Should this be reopened?
@ggerganov Should this be reopened?
Yes, same issue with up to date source code (tested on my laptop's Ryzen 5 5500u CPU).
Wondering if #293 is relevant here? CPU has better performance than BLAS backend does as noted in PR #7077. This has been true in desktop, laptop, and mobile. CPU consistently outperforms BLAS.
Wondering if #293 is relevant here? CPU has better performance than BLAS backend does as noted in PR #7077. This has been true in desktop, laptop, and mobile. CPU consistently outperforms BLAS.
Thanks. I did not notice all the BLAS back-ends were planned to be removed. I think that's good idea if we're sure BLAS offers no advantage whatsoever.
However, i'm intrigued : if OpenBLAS is misused - as seems to show my various tests - how could we be sure that it is actually outperformed by CPU implementation ?
Without digging seriously into the code, my assumption was that OpenBLAS is multi-threaded by default (edit : i watched the CPU utilization graph in real time to reach this assumption). So, in theory, it should show peak performance when ./benchmark
is ran with only -t 1
.
But how many thread does OpenBLAS actually utilize by default ?
And how does this interacts with the rest of the program which will actually use a single thread because of -t 1
?
So many questions and paradoxes :exploding_head:
how could we be sure that it is actually outperformed by CPU implementation ?
Use llama-bench
for comparisons. For details, reference Intels OneMKL.