OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

The performance of the OpenBlas library decreases when running in multi-threaded environments

Open fanjisheng520 opened this issue 3 months ago • 6 comments

hello, C++programs run on Ubuntu 24.04 using the OpenBlas library. The function execution time is less than 1ms on a single thread, but it increases several times on multiple threads, and the system call time also increases. The openblas library is installed using the apt tool, and the default settings are as follows: OpenBLAS 0.3.26 NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64

The single thread time is less than 1ms, and opening 8 threads can achieve a maximum of 10ms and a minimum of about 1ms. The thread function contains a loop traversal operation. Delete the following line of code, and the multi-threaded performance is similar to that of a single thread. arma::cx_mat pmusic = ss * nnn_md * ss.ht(); Among them, nnn_md is a complex matrix of 15 * 15,ss is a complex matrix 1*15 then,how to solve the problem of multi-threaded performance degradation? int main() { openblas_set_num_threads(1); int current_threads = openblas_get_num_threads(); printf("current thread num is %d\n", current_threads); for (int i = 0; i < 1;++i) { std::thread t1(processfunction); t1.detach(); } while (1) { } }

fanjisheng520 avatar Sep 28 '25 09:09 fanjisheng520

Ubuntu 24 is installed in a virtual machine with an 8-core CPU and 4GB of memory

fanjisheng520 avatar Sep 28 '25 09:09 fanjisheng520

How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)

martin-frbg avatar Sep 28 '25 10:09 martin-frbg

How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)

I calculate the current timestamp using the GetTimeStampUs function, and the CalcuAoaIqToAngleArma function is the main calculation function in my thread. I did not directly use openblas, but indirectly called it through the armadillo library.

I use openblas_set_num_threads (1) to set openblas single threaded, while in my program, the external application layer is multi-threaded

auto t1 = GetTimeStampUs(); if (!CalcuAoaIqToAngleArma(vecCmxData, rlt)) { printf("calc iq to angle error"); } auto t2 = GetTimeStampUs();

where, int64_t GetTimeStampUs() { auto now = std::chrono::system_clock::now(); auto microseconds_since_epoch = std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()); return microseconds_since_epoch.count(); } @martin-frbg

fanjisheng520 avatar Sep 28 '25 11:09 fanjisheng520

How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)

During function execution, I have set the internal opneblas to single threaded using the openblas_set_num_threads (1) function.Then, I open 8 application layer computation threads through the thread library in C++.

Why is the execution speed of matrix calculation faster in a single thread than in a multi-threaded system? In theory, each thread runs independently, and there are very few CPU thread switches. Therefore, I called the OpenBlas library, and the computation time in a multi-threaded system should be the same as that in a single thread. Are there any methods here that I haven't considered?

fanjisheng520 avatar Sep 28 '25 11:09 fanjisheng520

Maybe related: numpy/numpy#29884 where using nthreads == ncores slows down np.matmul on 100x100 double matrices. This in turn uses OpenBLAS syrk or gemm routines. Limiting nthreads to ncores - 2 on a 24 core Ryzen machine makes OpenBLAS fast again. Perhaps the threadpool usage could be limited to ceil(0.95 * ncores).

mattip avatar Oct 09 '25 08:10 mattip

Maybe related: numpy/numpy#29884 where using nthreads == ncores slows down np.matmul on 100x100 double matrices. This in turn uses OpenBLAS syrk or gemm routines. Limiting nthreads to ncores - 2 on a 24 core Ryzen machine makes OpenBLAS fast again. Perhaps the threadpool usage could be limited to ceil(0.95 * ncores).

Thank you for your reply. I use the armadillo library, which indirectly calls the openblas library. The possible reason is that the thread function contains a two-layer loop, with approximately 500 loops. The loop mainly includes matrix multiplication. Even if the number of threads changes from 1 to 2, the time consumption will increase. arma::cx_mat pmusic = ss * nnn_md * ss.ht(); Among them, nnn_md is a complex matrix of 15 * 15,ss is a complex matrix 1*15

fanjisheng520 avatar Oct 09 '25 09:10 fanjisheng520