The performance of the OpenBlas library decreases when running in multi-threaded environments
hello, C++programs run on Ubuntu 24.04 using the OpenBlas library. The function execution time is less than 1ms on a single thread, but it increases several times on multiple threads, and the system call time also increases. The openblas library is installed using the apt tool, and the default settings are as follows: OpenBLAS 0.3.26 NO_LAPACKE DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=64
The single thread time is less than 1ms, and opening 8 threads can achieve a maximum of 10ms and a minimum of about 1ms. The thread function contains a loop traversal operation. Delete the following line of code, and the multi-threaded performance is similar to that of a single thread. arma::cx_mat pmusic = ss * nnn_md * ss.ht(); Among them, nnn_md is a complex matrix of 15 * 15,ss is a complex matrix 1*15 then,how to solve the problem of multi-threaded performance degradation? int main() { openblas_set_num_threads(1); int current_threads = openblas_get_num_threads(); printf("current thread num is %d\n", current_threads); for (int i = 0; i < 1;++i) { std::thread t1(processfunction); t1.detach(); } while (1) { } }
Ubuntu 24 is installed in a virtual machine with an 8-core CPU and 4GB of memory
How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)
How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)
I calculate the current timestamp using the GetTimeStampUs function, and the CalcuAoaIqToAngleArma function is the main calculation function in my thread. I did not directly use openblas, but indirectly called it through the armadillo library.
I use openblas_set_num_threads (1) to set openblas single threaded, while in my program, the external application layer is multi-threaded
auto t1 = GetTimeStampUs();
if (!CalcuAoaIqToAngleArma(vecCmxData, rlt))
{
printf("calc iq to angle error");
}
auto t2 = GetTimeStampUs();
where,
int64_t GetTimeStampUs() { auto now = std::chrono::system_clock::now(); auto microseconds_since_epoch = std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()); return microseconds_since_epoch.count(); } @martin-frbg
How are you measuring cpu time, and which BLAS function(s) does your code call ? I have no idea what the "following line of code" does in this context, but with problem sizes around 15x15 it seems entirely possible that single-threaded execution is actually faster on any current desktop cpu than incurring the overhead of setting up multiple threads and the memory buffers for them. (The more so if the code calling OpenBLAS is multithreaded itself)
During function execution, I have set the internal opneblas to single threaded using the openblas_set_num_threads (1) function.Then, I open 8 application layer computation threads through the thread library in C++.
Why is the execution speed of matrix calculation faster in a single thread than in a multi-threaded system? In theory, each thread runs independently, and there are very few CPU thread switches. Therefore, I called the OpenBlas library, and the computation time in a multi-threaded system should be the same as that in a single thread. Are there any methods here that I haven't considered?
Maybe related: numpy/numpy#29884 where using nthreads == ncores slows down np.matmul on 100x100 double matrices. This in turn uses OpenBLAS syrk or gemm routines. Limiting nthreads to ncores - 2 on a 24 core Ryzen machine makes OpenBLAS fast again. Perhaps the threadpool usage could be limited to ceil(0.95 * ncores).
Maybe related: numpy/numpy#29884 where using
nthreads == ncoresslows down np.matmul on 100x100 double matrices. This in turn uses OpenBLASsyrkorgemmroutines. Limiting nthreads toncores - 2on a 24 core Ryzen machine makes OpenBLAS fast again. Perhaps the threadpool usage could be limited toceil(0.95 * ncores).
Thank you for your reply. I use the armadillo library, which indirectly calls the openblas library. The possible reason is that the thread function contains a two-layer loop, with approximately 500 loops.
The loop mainly includes matrix multiplication. Even if the number of threads changes from 1 to 2, the time consumption will increase.
arma::cx_mat pmusic = ss * nnn_md * ss.ht();
Among them, nnn_md is a complex matrix of 15 * 15,ss is a complex matrix 1*15