OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

How the control the number of threads of blas functions inside OMP parallel regions

Open bachml opened this issue 7 years ago • 7 comments

I am working on something like the following code:

omp_set_num_threads(nth);
#pragma omp parallel for private(g) schedule(static)
for(g = 0; g < size; g++) {
    cblas_sgemm( ..., matrix_A + g * offset , matrix_B + g * offset, ...);
}

Due to matrix_A and matrix_B have small size, it won't get much acceleration with multithread sgemm funtion. So I want to use all threads of the processor in this pragma, while each thread compute a single sgemm function indepentently, e.g., 1 thread per sgemm funtion.

But experiments shows that the performance has nothing difference compare to when I use multithread sgemm funtion without pragma. I'm wondering how could I control the number of threads of blas inside OPM parallel regions.

Thanks in advance.

bachml avatar May 31 '17 06:05 bachml

There is openblas_set_num_threads() but offhand I am not sure if it will do what you need. Perhaps the trivial solution in your case is to compile OpenBLAS itself single-threaded as the threading appears to be managed by the code that calls it.

martin-frbg avatar May 31 '17 08:05 martin-frbg

@martin-frbg
The trivial solution does not work, because multi-thread blas operation are still needed in other parts of the program.

P.S. Actually what I am working is to implemente a caffe version of MobileNet[1704.04861], which is based on parallel depth-wise convolution implementation with BLAS backend. That's why I need single thread sgemm for depth-wise convolution and multi-thread sgemm for other parts.

bachml avatar May 31 '17 09:05 bachml

I see. At least in theory, the xGEMM functions should refrain from using multiple threads if the matrix size is small, but this is governed by compile-time constants that may not be appropriate for your case. (Or perhaps they are, and you are already comparing two cases that have essentially the same single-threaded sgemm behaviour, with the bottleneck elsewhere ?) Which version of OpenBLAS are you using by the way, and on what platform ? Hopefully others here will be able to provide more informed comments.

martin-frbg avatar May 31 '17 09:05 martin-frbg

OpenBLAS calls omp_get_num_threads() just once per program's lifetime and has no instrumentation for OpenMP nesting, unless you use that nested stuff properly thread count will square, openblas evicted far from cache and performance further behind.

You could patch num_cpu_avail to ifdef-ly set value to omp_get_num_threads() and try nested hierarchy calls.

Also there are thresholds that reduce small samples to one thread in interface/*c which may need improvements if you dare to share your calibration results....

brada4 avatar May 31 '17 22:05 brada4

I need finer control of the # of threads that each call to the BLAS uses. My packages are themselves multithreaded. Each of my threads can make its own calls to the BLAS, in parallel. Some of those calls will be for small matrices, or for other cases where I know I need to use one thread. Other calls will want to use, say, 2 threads and no more (because I'm using threads elsewhere). Sometimes I want to use all the threads available. However, I cannot set a global setting, such as with openblas_set_num_threads, since that would affect all of my calls to the BLAS. My packages themselves are used in other parallel packages. So I need a thread-local way to set the # of threads that OpenBLAS uses, much like this function: https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads-local so that I can exactly control how many threads OpenBLAS uses, for each call to the BLAS.

The OpenBLAS cannot always assume it has all the threads available to it, since there are other things going on.

See also this discussion: https://github.com/DrTimothyAldenDavis/SuiteSparse/issues/1

Is this possible with OpenBLAS? Is there an OpenBLAS equivalent to mkl_set_num_threads_local?

DrTimothyAldenDavis avatar Feb 07 '20 23:02 DrTimothyAldenDavis

Coyld you open a new issue. Your request cannot be satisfied with time travel 3 years back when this issue was closed. Intels documentation you refer to states unpredictability.

brada4 avatar Feb 08 '20 01:02 brada4

OK ... thanks. I've posted this as a new issue.

DrTimothyAldenDavis avatar Feb 08 '20 02:02 DrTimothyAldenDavis

Finally closing here as we never got to learn the version number the OP was using nor the matrix size for the "small" case, OpenBLAS should have been using only one thread in the OpenMP-parallel region by design already. Continuing topic is/was #2392

martin-frbg avatar Jan 13 '24 18:01 martin-frbg