Hi,

My config with Intel Core i5 13600kf :

OS : Linux Mint 21.3, 64 bits, with kernel 6.5.0
CPU: Intel Core i5 13600kf
- P-Cores : 6 (hyper threaded)
- E-Cores : 8
SRAM : 64 GB
SSD : nvme
GPU : Nvidia RTX 4070 Ti Super (but not involved in the issue)
libopenblas-dev version 0.3.20 (edit : also tested with libopenblas64-dev)

llama.cpp version :

Today's freshly compiled source code.

The issue when compiled with OpenBLAS :

While benchmarking using both ./example/benchmark and ./example/main, I found there is an issue when llama.cpp is compiled with OpenBLAS : More threads = less performances (and more power consumption measured using a watt-meter).

./llama.cpp.openblas/benchmark -t %

Random guess : Is it possible that OpenBLAS is already multi-threaded and that llama.cpp try to multi-thread it again ?

Comparison with vanilla version :

For comparison, here is what i get with vanilla compiled version :

./llama.cpp.CPU/benchmark -t %

Note that the decrease of performance after -t 6 is due to the fact that the CPU, which has only 6 P-cores, starts using its E-cores that are waaaaaaaaay slower.

Comparison with cuBLAS version (Nvidia RTX 4070 Ti Super):

Note that, here again, the decrease of performance after -t 6 is due to the fact that the CPU, which has only 6 P-cores, starts using its E-cores that are waaaaaaaaay slower.

Feb 16 '24 13:02 SuperUserNameMan

Note : I've redone the tests with all the E-cores disabled, and the issue remains the same : more threads = less performances.

EDIT : and during the prompt eval time, OpenBLAS make use of as much threads as there are cores reported by the OS, even with -t 1 (so OpenBLAS is multi-threaded by default).

Feb 16 '24 14:02 SuperUserNameMan

I've redone the test using OpenBLAS-64 (instead if the 32bits version), and the issue remain the same.

Below is an other test made using my laptop CPU :

My config with AMD Ryzen 5 5500u :

OS : Linux Mint 21.2, 64 bits, with kernel 6.5.0
CPU: AMD Ryzen 5 5500u (laptop CPU)
- Cores : 6 (hyper threaded)
SRAM : 64 GB
SSD : nvme
GPU : embeded AMD Radeon iGPU (not involved in this issue)
libopenblas64-dev version 0.3.20

llama.cpp version :

Today's freshly compiled source code.

The issue when compiled with OpenBLAS :

Same issue : More threads = less performances.

./llama.cpp.openblas/benchmark -t %

Comparison with vanilla version :

./llama.cpp.CPU/benchmark -t %

Note that the decrease of performance after -t 6 is due to the fact that the CPU only has 6 cores and starts using Hyper-threading for the next threads.

Feb 18 '24 14:02 SuperUserNameMan

My config with Intel Core i3-12100f :

OS : Linux Mint 21.2, 64 bits, with kernel 6.5.0
CPU: Intel Core i3-12100f
- P-Cores : 4 (hyper threaded)
- E-Cores : 0
SRAM : 16 GB
SSD : hdd
GPU : Nvidia RTX 4060 (but not involved in the issue)
libopenblas-dev version 0.3.20 (edit : also tested with libopenblas64-dev)

llama.cpp version :

Today's freshly compiled source code.

The issue when compiled with OpenBLAS :

When llama.cpp is compiled with OpenBLAS : More threads = less performances.

./llama.cpp.openblas/benchmark -t %

Comparison with vanilla version :

./llama.cpp.CPU/benchmark -t %

Note that the decrease of performance after -t 4 is due to the fact that the CPU, which has only 4 cores, starts using hyper-threading

Comparison with cuBLAS version (Nvidia RTX 4060) :

Feb 18 '24 14:02 SuperUserNameMan

This issue seems to greatly affect imatrix computations as well — although I didn't run the full computations and just went by the ETA. These numbers are from an EC2 c5ad.8xlarge instance which is allocated 16 cores/32 threads of an AMD EPYC 7R32. The programs were linked with libopenblas-dev from Ubuntu Server 22.04 (jammy).

openblas, --threads 32

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 10149.1 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 31.88 seconds per pass - ETA 26.55 minutes

openblas, --threads 1

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 10191.4 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 14.95 seconds per pass - ETA 12.45 minutes

no blas, --threads 32

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 9793.42 ms
compute_imatrix: computing over 50 chunks with batch_size 128
compute_imatrix: 5.62 seconds per pass - ETA 4.68 minutes

Feb 19 '24 03:02 notwa

This issue was closed because it has been inactive for 14 days since being marked as stale.

Apr 04 '24 01:04 github-actions[bot]

@ggerganov Should this be reopened?

May 24 '24 07:05 oldgithubman

@ggerganov Should this be reopened?

Yes, same issue with up to date source code (tested on my laptop's Ryzen 5 5500u CPU).

May 24 '24 18:05 SuperUserNameMan

Wondering if #293 is relevant here? CPU has better performance than BLAS backend does as noted in PR #7077. This has been true in desktop, laptop, and mobile. CPU consistently outperforms BLAS.

May 24 '24 20:05 teleprint-me

Wondering if #293 is relevant here? CPU has better performance than BLAS backend does as noted in PR #7077. This has been true in desktop, laptop, and mobile. CPU consistently outperforms BLAS.

Thanks. I did not notice all the BLAS back-ends were planned to be removed. I think that's good idea if we're sure BLAS offers no advantage whatsoever.

However, i'm intrigued : if OpenBLAS is misused - as seems to show my various tests - how could we be sure that it is actually outperformed by CPU implementation ?

Without digging seriously into the code, my assumption was that OpenBLAS is multi-threaded by default (edit : i watched the CPU utilization graph in real time to reach this assumption). So, in theory, it should show peak performance when ./benchmark is ran with only -t 1.

But how many thread does OpenBLAS actually utilize by default ? And how does this interacts with the rest of the program which will actually use a single thread because of -t 1 ?

So many questions and paradoxes :exploding_head:

May 25 '24 09:05 SuperUserNameMan

how could we be sure that it is actually outperformed by CPU implementation ?

Use llama-bench for comparisons. For details, reference Intels OneMKL.

May 25 '24 20:05 teleprint-me

llama.cpp
llama.cpp copied to clipboard

OpenBLAS Linux : more thread = less performances

My config with Intel Core i5 13600kf :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

Comparison with cuBLAS version (Nvidia RTX 4070 Ti Super):

My config with AMD Ryzen 5 5500u :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

My config with Intel Core i3-12100f :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

Comparison with cuBLAS version (Nvidia RTX 4060) :

openblas, --threads 32

openblas, --threads 1

no blas, --threads 32

llama.cpp llama.cpp copied to clipboard

OpenBLAS Linux : more thread = less performances

My config with Intel Core i5 13600kf :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

Comparison with cuBLAS version (Nvidia RTX 4070 Ti Super):

My config with AMD Ryzen 5 5500u :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

My config with Intel Core i3-12100f :

llama.cpp version :

The issue when compiled with OpenBLAS :

Comparison with vanilla version :

Comparison with cuBLAS version (Nvidia RTX 4060) :

openblas, --threads 32

openblas, --threads 1

no blas, --threads 32

llama.cpp
llama.cpp copied to clipboard