OpenBLAS
                                
                                 OpenBLAS copied to clipboard
                                
                                    OpenBLAS copied to clipboard
                            
                            
                            
                        Functions are not guarded against early threading
Not exhaustive list, just those used in practice, sort of measurement takes some time
Discovered in #1882 - dgetrf zherk dsyrk
From #1883 - axpy
From #1897 (mxnet) cblas_?syrk _potri_
From https://github.com/JuliaRobotics/RigidBodyDynamics.jl/issues/500 trsm
Probably worth making into task list, sorting by absent guard and suspected wrong guard.
Based on preliminary benchmarks it may make sense to double the multithreading thresholds in interface/trsm.c for the two COMPLEX cases and raise it to four or five times the current limit (of 2*GEMM_MULTITHREAD_THRESHOLD) for float and double.
https://github.com/xianyi/OpenBLAS/issues/1989#issuecomment-460371387 has test case vs dsymv via lapack, there is no performance regression, but no thread threshold either.
#1115 is relevant for dsyrk (and suggests that level3 unrolling is an important factor besides the multithreading thresholds or lack thereof in the interface)
Any progress on this? I am trying to get a grip on https://github.com/DrTimothyAldenDavis/SuiteSparse/issues/1 to decide if it is worth it packaging Octave with sparse matrix computations via SuiteSparse and OpenBLAS in pkgsrc. It's quite a bummer to have the prominent notice not to use OpenBLAS for SuiteSparse.
For now I do not even know how real the threat of abysmal performance is (factor 1000?!). Uncertainty, doubt. It's an unfortunate situation that doesn't seem to see progress.
I'm just now experimenting with a few things (#3252 that is a bit broken on arm64 right now)
@drhpc the initial idea is fairly simple to draw the line in the sand where computing switches from one to all cpus, it might be slightlt dated with more and more cores per CPU appear.
cholmod manages threads via (G|I)OMP, and calls these:
    liblapack.so.3 => zpotrf_,dpotrf_
    libopenblas_pthreads.so.0 => dtrsm_,ztrsm_,dgemm_,dtrsv_,ztrsv_,zgemv_,zherk_,dgemv_,dsyrk_,zgemm_
So you actually need some formular or table to use 1, 2, 4, … $many cores depending on data size? Lots of heuristics have to go into that:-/ Right now we still got 16 cores per node in the cluster, but soon that will probably be 128. Some steps in between seem sensible.
My issue right now is that I'd like to see an actual little test that checks how badly we are actually affected by such. I am no user of cholmod myself. Those (vague) reports about deadlock-like situations with factor 1000 speed-down are frightening.
And frankly, I got a colleague who has deep trouble understanding why you would call an optimized/parallelized dense BLAS on bits of sparse matrices anyway. But apparently people do that. We both are outside our realms of experience there.
I moved to other discipline of computers, If you have time, at least give it a try:
- In interfaces/pick one that does not have thread threshold, so you cannot make it worse than it is...
- drill respective benchmark/one-threaded, all-threaded, overlay graphs, find roughly what size of tasks starts to gain well from all-threading.
- Note that Level1 BLAS has one dimension and is easy to do, Level2 has 2, and on Level3 you cannot even draw a 4D picture, you need to model optimum, then track back to input parameters.
- Most helpful is linux perf record ; perf report, excess threading is marked as something else using more CPU time than blas function.
- Another is the work splitters right after threading, shall they split in cache-line, memory-page, NUMA stride chunks (that would limit threading for super-small samples by other constraint)
- Another - if L2 caches communicate between them at slower than own speed, mostly yes, It might be worth to ladder core per full cache. e.g worst case would be pumping page over system bus between cores eating out bandwidth from RAM and PCI, happens visibly on low-power SoCs
IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.
I am not sure if heuristics for ramping up the number of threads is what is needed, experiments so far seem to indicate that the added overhead and the need to get it right across a wide range of platforms makes it not worthwile. In addition to the general overhead of premature multithreading for small cases, I have recently identified a historical oversight where a memory buffer is allocated although it is never used - this is what the PR is trying to address. At least in pthreads-based builds, the speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.
My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware. AVX512 may be a slightly different story as that is less well supported in OpenBLAS, but I doubt the apocryphal "a 100 times slower" and I am not inclined to go looking for that without at least some attempt at a reproducer
My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware.
I am trying to find out how relevant this issue is for us. It might well only hit for high numbers of threads/cores.
Can you give me the exact test code/compiler commands you used? I'll run it on our machines with up to 40 cores and compare.
CHOLMOD/Demo/Readme.txt explains that.
OpenBLAS must be built typing make.
I confirm Martin's result on early pre-avx512 skylake with 6 cores + HT , on AMD though tables turn other way around.
I built OpenBLAS with default options (and both with and without USE_OPENMP=1), built SuiteSparse with either make CC=icc CXX=icc or make BLAS="/opt/OpenBLAS/lib -lopenblas" LAPACK="/opt/OpenBLAS/lib -lopenblas. Then mainly ran cholmod_demo <Matrix/bcsstk01.tri from its CHOLMOD/Demo subdirectory. (Did not see much variation in results for OpenBLAS versions ranging from 0.2.17 to today, but no dramatic improvements from my proposed patches either.)
Have not tried anything bigger than the old hexacore i7-8700K in my office, may find time to run it on Ryzen 3900X or package the whole thing for CI. (On AMD you need to set an environment variable to make MKL skip its cpu checks, else you'll end up
in a crawling "compatibilty mode" when it sees a non-Intel cpu)
Davis states he cannot remember when or which version, also expects the provided matrix files in the cholmod demo to be much too small to reproduce the problem. I see about 20 percent advantage for MKL with bigger files from his sparse matrix collection now (e.g. the 18000x18000 nd6k.mtx that is referenced in the "gpu.sh" example or the 46000x46000 bcsstk39.mtx), the main culprit appears to be SYRK (which is not affected by my experimental patches as of yet)
Sorry for being silent here … but I figured that my tests with these examples wouldn't probably not change much in the picture. 20 % difference definitely is not what he is talking about, we're looking for 100 times slowdown with multiple threads:-/
Actually …
IMO sparsesuite used pthread version that each openmp thread actually was
nprocmore threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.
… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?
… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?
Quite often, at least performance impact stated looks that spectacular. Some performance profile from damage is very welcome here, one can even send screenshot picture that non-productive function eats lots of CPU time, there are some cases where we could re-construct regression source from that.
n^3 overcommitment can be achieved too. Run n processes, then automatic n omp threads in process each prime n pthreads. That would be characterized as "grinding for days without any sensible result while one-cpu docker finishes in X seconds..."
FWIW, I got our drone.io CI to build and run SuiteSparse with current develop branch on 24-core/48-thread AMD Epyc, and while I do not have any MKL numbers for comparison, at least the OpenBLAS version does not lock up or spend extraordinary time on the 18k-by-18k "nd6k" cholmod test. I could repeat the same test on a 96-core ARM64 server there if desired.
And we got Tim Davis digging for that original test case … which will be part of the CI soon, I hope.
One thought: Could the CI on the 24 core EPYC easily do a run with past OpenBLAS versions to test the hypothesis that the introduction of extra locking around 3 years ago influenced things, and maybe the behaviour got fixed in the meantime?
I could probably massage the yml file into downloading some older release instead of the current head. (And I now believe the claim of the extreme slowdown comes from #2265 (late 2019, but only referring to unspecified prepackaged Ubuntu and RedHat builds of OpenBLAS, which could have been anything from a year earlier, or even a bug with the libgomp or whatever)
Unfortunately the performance of the CI host turns out to be far too variable to use for benchmarking past releases (more than a factor of ten between individual runs of cholmod <nd6k.mtx with the exact same build)
A pointer to a test case was just added to the suitesparse issue ticket mentioned above (CHOLMOD of a 60000x60000 matrix, multithreaded run seen spending an inordinate amount of "system" time probably spinning on a lock somewhere)
Normally it is to run 1 and 2 and all NUM_THREADS ans compare top10 in perf record -p <pid> ; perf report
It was sched_yield often, but with that now out i dont know what it could be.
PS almost forgot about this.
Seems I can only reproduce that when I combine (OpenMP-using) SuiteSparse with a non-OpenMP build of OpenBLAS, when both are OpenMP it is a simple case of finding the optimum number of threads for the problem size. In the broken ("mixed") case, most time is spent in gomp_barrier_wait_end - not sure if there is anything new to learn from this.
Came across https://github.com/mlpack/mlpack/issues/1858 and the suggestion to set OMP_WAIT_POLICY=passive solves the problem for the "mixed" run (though in that case single-threading remains 40 percent faster than even running with 2 threads,and using all 6 is 3.4 times slower than single.)
Set threshold something samplesize < pagesize * ncpu without measurement? I think thats faster , and obvious !damage dragging page between CPU caches....
closing here as the original topic has been resolved by adding lower bounds and in some cases shortcuts in the 0.3.16 to 0.3.21 timeframe. The suitesparse issue is also tracked in #2265 - I intend to perform another test run with current versions locally and/or in the GCC compile farm but preliminary results underline that the old claim of a 100x slowdown can only have been based on one unfortunate mix of unspecified versions and build options.
Thanks, havent seen class of issues in a while and frankly forgot this was open.