OpenBLAS
OpenBLAS copied to clipboard
openBLAS nested parallelism
Hi,
we are trying to run two instances of cblas_dgemm in parallel. If the total number of threads is 16, we would like each instance to run using 8 threads. Currently, we are using a structure like this:
#pragma omp parallel num_threads(2)
{
if (omp_get_thread_num() == 0){
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA, bs3, pTmpB, bs2, beta, pTmpC, bs2);
}else {
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
bs1, bs2, bs3, alpha, pTmpA2, bs3, pTmpB2, bs2, beta, pTmpC2, bs2);
}
}
Here is the issue:
-
At the top level, there are two OpenMP threads each of which is active inside one of the if/else blocks. Now, we expect those threads to call the cblas_dgemm functions is parallel, and inside those cblas_dgemm functions, we expect new threads to be spawned.
-
To set the number of threads internal to each cblas_dgemm, we set the corresponding environment variable: setenv OPENBLAS_NUM_THREADS 8 However, it doesn't seem to be working. If we measure the runtime for each of the parallel calls, the runtime values are equal, but they are equal to the runtime of a single cblas_dgemm call when nested parallelism is not used and the environment variable OPENBLAS_NUM_THREADS is set to 1.
What is going wrong? and how can we have the desired behavior? Is there any way we could know the number of threads inside the cblas_dgemm function?
Thank you very much for your time and help
How big is your matrix ? OpenBLAS will not use more than one thread if the product of the dimensions M, N and K is smaller than SMP_THRESHOLD_MIN*GEMM_MULTITHREAD_THRESHOLD
(65535*4 =256k by default)
Thank you very much @martin-frbg. The matrices are rather large (M = N = K = 1024 or above). So I don't think that is the issue.
I do not think there is a direct way to get the number of threads inside dgemm, you'd either need to look at your running program in a debugger, or instrument interface/gemm.c to print the args.nthreads it has decided to use. Which version of OpenBLAS, what hardware and operating system are you using ?
We are using OpenBLAS 0.3.5 on AMD opteron 6168 and the OS is Ubuntu 16.04 (Xenial). We have actually done the following: We modified the function cblas_dgemm.c inside the OpenBLAS directory to print out the number of threads at the very beginning of the function. We used printf("%d\n", omp_get_num_threads()). Then compiled the whole library and linked it to our code. We expected that calling cblas_dgemm would cause the number of its internal threads to be printed, but that didn't happen.
You can try the BLAS extension openblas_get_num_threads()
Thank you very much @martin-frbg . I made this change, but still nothing is printed out.
That is a bit suspicious, are you sure that your program actually loads OpenBLAS at runtime, and not something else (like single-threaded reference BLAS from netlib) through the "alternatives" mechanism of Ubuntu ?
We explicitly provide the link to libopenblas.so. However, the source code we modify is from an OpenBLAS folder where the only cblas_dgemm.c is inside a folder called lapack-netlib. So that is suspicious as you say. However, if we remove the nested parallelism structure and leave only one call to cblas_dgemm; and if we set the number of openBLAS threads to different values using the environment variable OPENBLAS_NUM_THREADS , then the resulting runtime is sensitive to the number of threads.
Thats upstream (Netlib LAPACK) stuff that does not run parallel. cblas symbols are provided directly from OpenBLAS without extra wrapper.
Try adding your printout in interface/gemm.c - this file gets compiled twice from the Makefile, once with -DCBLAS and once without, to give both cblas_dgemm
and dgemm
(as well as sgemm, cgemm, zgemm and their cblas counterparts by (un)defiing DOUBLE and COMPLEX as needed). The BLAS parts of lapack-netlib are not used in OpenBLAS, that directory is only included for LAPACK.
(Sorry for not spotting this last night)
Seeing OpenMP in code - you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.
Complementing to what martin said - you can use ltrace to get list of from which libraries which functions got called, or use perf record ./program ; perf report
to find those just using most of CPU time.
More pragmatic approach would be to build against Netlib BLAS provided by Ubuntu, confirming it works at all, then use alternatives to supplant the library with OpenBLAS.
Thank you very much @martin-frbg . I modified interface/gemm.c and put a print statement in each of the functions, but still nothing is printed out when I run my code. I suspect maybe I am doing the linking in a wrong way.
- We don't have root access to the system, so we can not install the library after it is compiled
- In the parent folder (OpenBLAS-0.3.5), there is a file named libopenblas.so
- In the directory where our code resides, I make a new directory called newDIR and copy libopenblas.so into that.
- I compile our code using
gcc -O3 -fopenmp OURCODE.c -o OUTPUT.out -LnewDIR -lopenblas
Is there something wrong with how I am linking the library? I will really appreciate your help
Thank you very much @brada4 . I have a question. Could you please explain a little more about
you need to build OpenBLAS with OpenMP support, that "support" is quite rudimentary and turns into single-threaded OpenBLAS computation inside your parallel sections.
Actually, I am compiling the code using -fopenmp
flag and there are two threads in the outer-level of the nested parallel section. Is that enough? Or is there anything else I should do? I am asking because I read somewhere that openMP threads may conflict with openBLAS threads and I suspect maybe that is somehow related to the support you are talking about.
When you compile your code with "-lopenblas" this does not automatically ensure that exactly the same version of openblas will be loaded at runtime - there might be so other (and potentially older) version installed somewhere in the default library search paths on the system (like /lib, /usr/lib or /usr/local/lib).
Running ldd
on your program should show which libopenblas gets loaded by default, setting the LD_LIBRARY_PATH environment variable to your directory should make it look there.
Namely following FAQ entries apply: https://github.com/xianyi/OpenBLAS/wiki/faq#debianlts https://github.com/xianyi/OpenBLAS/wiki/faq#wronglibrary
Thank you very much @martin-frbg . It worked and now the number of threads is printed out. There is just one other issue: The first time I compiled and linked the library, there was an error that
OpenBLAS Warning : Detect OpenMP Loop and this application may hang. Please rebuild the library with USE_OPENMP=1 option
However, then the number of threads was sensitive to the environment variable OPENBLAS_NUM_THREADS and by changing this variable, the number of threads that were printed out did vary.
After I recompiled the library using USE_OPENMP=1 , there are no more warnings, but now however I modify OPENBLAS_NUM_THREADS, the number of threads that is printed out is always 24 (the maximum number of threads in the system). Is there any way I can fix this problem? Thank you again
Thank you very much @brada4
Probably thread safety improved a lot since that warning was introduced and nothing hangs recently. Thread number detected is not as important as total run time reduction
Could be that it is always returning the value of OMP_NUM_THREADS now unfortunately. You can try removing the "#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c (this could be another bug related to my earlier mis-edit uncovered in #2002 - memory.c basically contains two versions of the thread setup code so you will see two definitions of blas_get_cpu_number there). Despite the recent thread safety improvements I do not think it is safe to mix OPENMP and non-OPENMP codes - the OpenMP management functions will not know anything about plain pthreads outside its control...
Thank you very much @martin-frbg . I removed the
#ifndef USE_OPENMP" (and matching #endif) around line 1952 of memory.c but it still doesn't work.
And there is another issue: Inside interface/gemm.c, I have put two print statements:
- One at the beginning of the function CNAME . This one prints the number of threads returned by openblas_get_num_threads()
- Another one at the very end of the function CNAME. This print statement prints the value args.nthreads
If we remove the nested parallel structure and only call one instance of cblas_dgemm, both the printed values are 24. However, if we use the nested parallel structure, the printf at the beginning of CNAME, prints 24, but the one at the end of CNAME prints 1. What can be going wrong?
And here is our nested parallel structure (so that you don't have to go all the way up in the early posts): #pragma omp parallel num_threads(2) { if (omp_get_thread_num() == 0){ //First call, with first set of arguments cblas_dgemm(); }else { //Second call, with second set of arguments cblas_dgemm(); } }
Thank you very much @brada4 , but for our case we need to know the number of threads in each block. And the other thing is that we are not getting any runtime improvement compared to the case where we call the two functions sequentially and that is really strange. So there may be something wrong with thread distribution and we need to figure that out.
You can count CPU usage with "time" command - like if user+system > total then you use threads.
args.nthreads in interface/gemm.c should only become 1 when the product of the matrix dimensions is small, perhaps print args.m, args.n, args.k at that point as well to in case your code divides the workload unevenly between the two instances. (Print num_cpu_avail(3) just to be sure, though I do not think it could be 1)
Thank you @martin-frbg . For our problem args.m = args.n = args.k >= 512. That was verified after interface/gemm.c printed out these values.
However, the return value of num_cpu_avail(3) is printed out as 1. That is quite surprising. Because there are 24 cpus available in our system.
Thank you @brada4 .
In accordance to my previous comment: If we only call one instance of cblas_dgemm and remove the nested parallelism, then the output of num_cpu_avail(3) will be 24. Therefore, the idea that maybe the system is in use by other programs cannot hold in this case.
Another thing that is somehow surprising to me is that if I use the following setting for cup affinity:
setenv GOMP_CPU_AFFINITY "0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16"
Then regardless of whether or not we are using nested parallelism, the return value of openblas_get_num_threads() at the beginning of CNAME, the value of args.nthreads at the end of CNAME and the return value of num_cpu_avail(3) will all be 1. What can be the reason for all of this? Thank you again.
Are you certain you use same openblas library for each test?
Yes, I am sure. There is only one openBLAS library that is modified to print out number of threads and number of CPUs available. And I am using that.
You can try omp_get_num_threads() , I think openblas_get_. just gets number from there.