OpenBLAS
OpenBLAS copied to clipboard
Multithreaded DGBMV degrades performance on AMD
I have a driver cpp file that calls cblas_dgbmv function with proper arguments. When I build OpenBLAS with "make", dgbmv runs automatically with 8 threads (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour). On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well. All good for now.
The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time. My driver is sequential. However, even 2 threaded dgbmv degrades the performance (execution time) , being a single multithreaded call, without the loop.
I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications. There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock). IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS. However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.
I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :
double seconds,timing=0.0;
//for(int i=0; i<10000; i++){
seconds = omp_get_wtime ( );
cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
timing += omp_get_wtime ( ) - seconds;
// }
I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...). Here is my Makefile to compile both lbrary and the application :
myBinary: myBinary.cpp
cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4 && make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1 install
g++ myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc++fs -std=c++17
Architecture : 64 cores shared memory with AMD Opteron Processors
I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.
Not sure I understand you correctly - by "only two threads degrades the performance for a single multithreaded run" do you mean performance of other threads is impacted, or just that in your benchmark, DGBMV with 2 threads is already slower than DGBMV in a single thread ? (Also how big is your matrix for the test ?)
Thanks for your attention! I actually meant the latter. DGBMV with 2 threads is already slower than DGBMV in a single thread (in other words sequential run). Also, the input matrix I sent to DGBMV is a sparse matrix with dimension 17 milllion x 17 milllion. Somehow it might be related to my architecture with NUMA nodes and AMD processors.
NUMA could certainly be a problem. Perhaps it would also make sense to try a simple driver without OpenMP to exclude any other overhead or contention ? I notice that the benchmark directory does not contain sample code for ?gbmv, and the gbmv implementation is from the original GotoBLAS code with a few later bugfixes so unsure how well optimized this is.
You need to set some threshold to not spin threads on small(-er than that) inputs, like e.g gemm.c for extensive example. Like to cover time scheduling thread vs actual computation done there. It is wrong to spin threads for microscopic samples.
Perhaps it would also make sense to try a simple driver without OpenMP to exclude any other overhead or contention ?
I have adjusted my driver so that it does not include or compile openmp library at all, but in that case i had to use c++ clock to measure the execution time instead of wall. Apart from that, without measuring the time at all, and using one single call to dgbmv multithreaded did not yield any scalability either, it clearly takes longer than the sequential run.
You need to set some threshold to not spin threads on small(-er than that) inputs, like e.g gemm.c for extensive example. Like to cover time scheduling thread vs actual computation done there. It is wrong to spin threads for microscopic samples.
Pardon me, I'm confused about what you mean by small inputs. My test matrices have dimensions like 17million x 17million, so they are not small indeed. In this case, is "spinning" threads still something for me to worry about ? Besides, how can I display running or waiting threads as you have said in the run-time? The most relevant function I could find in an openblas header is related to printing max. thread numbers set by the env. var.
There is some small latency communicating initial tasks between threads, certainly not affecting your experiment.
I am trying to recreate test case from your description.
Since ?N? values are close to int32 another path to search is that does not overflow somewhere inside OpenBLAS, disregard if you use INTERFACE=64.
There is some small latency communicating initial tasks between threads, certainly not affecting your experiment.
I am trying to recreate test case from your description.
Since ?N? values are close to int32 another path to search is that does not overflow somewhere inside OpenBLAS, disregard if you use INTERFACE=64.
Yes, I do use INTERFACE=64.
If you had the chance to reproduce the scenario, were there any problems with multithreaded OpenBLAS?
You say 2PB matrix is in use. :-S What is N in your sample code? Content of matrices is not important, it is fixed-time unconditional computation at N^2 complexity. I have tried all up to 16GB in size and no fault there.
In my system, Multithreaded OpenBlas threads are sleeping and running in turn (although in my own openmp parallel program all threads always in running state), and their CPU utilization is low, somewhere around 60%. I wonder why this happens, unusual and maybe this is the bottleneck.
In my sample code, input matrix is square with N=500,000 and its NNZs are 17 millions. However, DBMV function already takes a LAPACK array input with dimensions (2*bandiwth x N , in my case: 2549 x 500,000 doubles = 1274500000 bytes) that makes 10 gb.
What is your architecture again? Do you observe the speedup with multithreaded calls?
Does running your sample code under perf record (and analyzing the result with perf report) show anything suspicious among the top cpu hogging functions ? I've run a quick check based on the dblat2 test (reduced dblat2.dat to just DGBMV and increased the matrix size) and saw no appreciable speedup from multithreading, but perf shows everything dominated by the time spent setting up the banded matrix.
e.g. run perf record/report in different directories, then look at the differences if in multiprocessor case something not related to actual compute, like mutex wait or something takes anomalous surplus time. Those are supposed to protect critical small atomic updates while the real compute threads should have mutually independent outputs that do not require any locking during heavy lifting by CPU. May you also post like first 10 to 20 lines from each "perf report" , more eyes/crystal balls always better.