OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Functions are not guarded against early threading

Open brada4 opened this issue 6 years ago • 28 comments

Not exhaustive list, just those used in practice, sort of measurement takes some time

Discovered in #1882 - dgetrf zherk dsyrk From #1883 - axpy From #1897 (mxnet) cblas_?syrk _potri_ From https://github.com/JuliaRobotics/RigidBodyDynamics.jl/issues/500 trsm

brada4 avatar Nov 26 '18 16:11 brada4

Probably worth making into task list, sorting by absent guard and suspected wrong guard.

brada4 avatar Nov 26 '18 17:11 brada4

Based on preliminary benchmarks it may make sense to double the multithreading thresholds in interface/trsm.c for the two COMPLEX cases and raise it to four or five times the current limit (of 2*GEMM_MULTITHREAD_THRESHOLD) for float and double.

martin-frbg avatar Jan 13 '19 21:01 martin-frbg

https://github.com/xianyi/OpenBLAS/issues/1989#issuecomment-460371387 has test case vs dsymv via lapack, there is no performance regression, but no thread threshold either.

brada4 avatar Feb 05 '19 20:02 brada4

#1115 is relevant for dsyrk (and suggests that level3 unrolling is an important factor besides the multithreading thresholds or lack thereof in the interface)

martin-frbg avatar Feb 06 '19 08:02 martin-frbg

Any progress on this? I am trying to get a grip on https://github.com/DrTimothyAldenDavis/SuiteSparse/issues/1 to decide if it is worth it packaging Octave with sparse matrix computations via SuiteSparse and OpenBLAS in pkgsrc. It's quite a bummer to have the prominent notice not to use OpenBLAS for SuiteSparse.

For now I do not even know how real the threat of abysmal performance is (factor 1000?!). Uncertainty, doubt. It's an unfortunate situation that doesn't seem to see progress.

drhpc avatar May 29 '21 09:05 drhpc

I'm just now experimenting with a few things (#3252 that is a bit broken on arm64 right now)

martin-frbg avatar May 29 '21 11:05 martin-frbg

@drhpc the initial idea is fairly simple to draw the line in the sand where computing switches from one to all cpus, it might be slightlt dated with more and more cores per CPU appear.

cholmod manages threads via (G|I)OMP, and calls these:

    liblapack.so.3 => zpotrf_,dpotrf_
    libopenblas_pthreads.so.0 => dtrsm_,ztrsm_,dgemm_,dtrsv_,ztrsv_,zgemv_,zherk_,dgemv_,dsyrk_,zgemm_

brada4 avatar May 29 '21 12:05 brada4

So you actually need some formular or table to use 1, 2, 4, … $many cores depending on data size? Lots of heuristics have to go into that:-/ Right now we still got 16 cores per node in the cluster, but soon that will probably be 128. Some steps in between seem sensible.

drhpc avatar May 29 '21 12:05 drhpc

My issue right now is that I'd like to see an actual little test that checks how badly we are actually affected by such. I am no user of cholmod myself. Those (vague) reports about deadlock-like situations with factor 1000 speed-down are frightening.

And frankly, I got a colleague who has deep trouble understanding why you would call an optimized/parallelized dense BLAS on bits of sparse matrices anyway. But apparently people do that. We both are outside our realms of experience there.

drhpc avatar May 29 '21 12:05 drhpc

I moved to other discipline of computers, If you have time, at least give it a try:

  • In interfaces/ pick one that does not have thread threshold, so you cannot make it worse than it is...
  • drill respective benchmark/one-threaded, all-threaded, overlay graphs, find roughly what size of tasks starts to gain well from all-threading.
  • Note that Level1 BLAS has one dimension and is easy to do, Level2 has 2, and on Level3 you cannot even draw a 4D picture, you need to model optimum, then track back to input parameters.
  • Most helpful is linux perf record ; perf report , excess threading is marked as something else using more CPU time than blas function.
  • Another is the work splitters right after threading, shall they split in cache-line, memory-page, NUMA stride chunks (that would limit threading for super-small samples by other constraint)
  • Another - if L2 caches communicate between them at slower than own speed, mostly yes, It might be worth to ladder core per full cache. e.g worst case would be pumping page over system bus between cores eating out bandwidth from RAM and PCI, happens visibly on low-power SoCs

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

brada4 avatar May 29 '21 13:05 brada4

I am not sure if heuristics for ramping up the number of threads is what is needed, experiments so far seem to indicate that the added overhead and the need to get it right across a wide range of platforms makes it not worthwile. In addition to the general overhead of premature multithreading for small cases, I have recently identified a historical oversight where a memory buffer is allocated although it is never used - this is what the PR is trying to address. At least in pthreads-based builds, the speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

martin-frbg avatar May 29 '21 17:05 martin-frbg

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware. AVX512 may be a slightly different story as that is less well supported in OpenBLAS, but I doubt the apocryphal "a 100 times slower" and I am not inclined to go looking for that without at least some attempt at a reproducer

martin-frbg avatar Jun 02 '21 06:06 martin-frbg

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware.

I am trying to find out how relevant this issue is for us. It might well only hit for high numbers of threads/cores.

Can you give me the exact test code/compiler commands you used? I'll run it on our machines with up to 40 cores and compare.

drhpc avatar Jun 02 '21 07:06 drhpc

CHOLMOD/Demo/Readme.txt explains that. OpenBLAS must be built typing make. I confirm Martin's result on early pre-avx512 skylake with 6 cores + HT , on AMD though tables turn other way around.

brada4 avatar Jun 02 '21 11:06 brada4

I built OpenBLAS with default options (and both with and without USE_OPENMP=1), built SuiteSparse with either make CC=icc CXX=icc or make BLAS="/opt/OpenBLAS/lib -lopenblas" LAPACK="/opt/OpenBLAS/lib -lopenblas. Then mainly ran cholmod_demo <Matrix/bcsstk01.tri from its CHOLMOD/Demo subdirectory. (Did not see much variation in results for OpenBLAS versions ranging from 0.2.17 to today, but no dramatic improvements from my proposed patches either.) Have not tried anything bigger than the old hexacore i7-8700K in my office, may find time to run it on Ryzen 3900X or package the whole thing for CI. (On AMD you need to set an environment variable to make MKL skip its cpu checks, else you'll end up in a crawling "compatibilty mode" when it sees a non-Intel cpu)

martin-frbg avatar Jun 02 '21 12:06 martin-frbg

Davis states he cannot remember when or which version, also expects the provided matrix files in the cholmod demo to be much too small to reproduce the problem. I see about 20 percent advantage for MKL with bigger files from his sparse matrix collection now (e.g. the 18000x18000 nd6k.mtx that is referenced in the "gpu.sh" example or the 46000x46000 bcsstk39.mtx), the main culprit appears to be SYRK (which is not affected by my experimental patches as of yet)

martin-frbg avatar Jun 07 '21 20:06 martin-frbg

Sorry for being silent here … but I figured that my tests with these examples wouldn't probably not change much in the picture. 20 % difference definitely is not what he is talking about, we're looking for 100 times slowdown with multiple threads:-/

drhpc avatar Jun 07 '21 21:06 drhpc

Actually …

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

drhpc avatar Jun 07 '21 23:06 drhpc

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

Quite often, at least performance impact stated looks that spectacular. Some performance profile from damage is very welcome here, one can even send screenshot picture that non-productive function eats lots of CPU time, there are some cases where we could re-construct regression source from that.

n^3 overcommitment can be achieved too. Run n processes, then automatic n omp threads in process each prime n pthreads. That would be characterized as "grinding for days without any sensible result while one-cpu docker finishes in X seconds..."

brada4 avatar Jun 08 '21 15:06 brada4

FWIW, I got our drone.io CI to build and run SuiteSparse with current develop branch on 24-core/48-thread AMD Epyc, and while I do not have any MKL numbers for comparison, at least the OpenBLAS version does not lock up or spend extraordinary time on the 18k-by-18k "nd6k" cholmod test. I could repeat the same test on a 96-core ARM64 server there if desired.

martin-frbg avatar Jun 09 '21 16:06 martin-frbg

And we got Tim Davis digging for that original test case … which will be part of the CI soon, I hope.

One thought: Could the CI on the 24 core EPYC easily do a run with past OpenBLAS versions to test the hypothesis that the introduction of extra locking around 3 years ago influenced things, and maybe the behaviour got fixed in the meantime?

drhpc avatar Jun 09 '21 18:06 drhpc

I could probably massage the yml file into downloading some older release instead of the current head. (And I now believe the claim of the extreme slowdown comes from #2265 (late 2019, but only referring to unspecified prepackaged Ubuntu and RedHat builds of OpenBLAS, which could have been anything from a year earlier, or even a bug with the libgomp or whatever)

martin-frbg avatar Jun 09 '21 18:06 martin-frbg

Unfortunately the performance of the CI host turns out to be far too variable to use for benchmarking past releases (more than a factor of ten between individual runs of cholmod <nd6k.mtx with the exact same build)

martin-frbg avatar Jun 13 '21 12:06 martin-frbg

A pointer to a test case was just added to the suitesparse issue ticket mentioned above (CHOLMOD of a 60000x60000 matrix, multithreaded run seen spending an inordinate amount of "system" time probably spinning on a lock somewhere)

martin-frbg avatar Nov 09 '21 21:11 martin-frbg

Normally it is to run 1 and 2 and all NUM_THREADS ans compare top10 in perf record -p <pid> ; perf report It was sched_yield often, but with that now out i dont know what it could be. PS almost forgot about this.

brada4 avatar Nov 10 '21 17:11 brada4

Seems I can only reproduce that when I combine (OpenMP-using) SuiteSparse with a non-OpenMP build of OpenBLAS, when both are OpenMP it is a simple case of finding the optimum number of threads for the problem size. In the broken ("mixed") case, most time is spent in gomp_barrier_wait_end - not sure if there is anything new to learn from this.

martin-frbg avatar Nov 10 '21 20:11 martin-frbg

Came across https://github.com/mlpack/mlpack/issues/1858 and the suggestion to set OMP_WAIT_POLICY=passive solves the problem for the "mixed" run (though in that case single-threading remains 40 percent faster than even running with 2 threads,and using all 6 is 3.4 times slower than single.)

martin-frbg avatar Nov 10 '21 22:11 martin-frbg

Set threshold something samplesize < pagesize * ncpu without measurement? I think thats faster , and obvious !damage dragging page between CPU caches....

brada4 avatar Nov 11 '21 13:11 brada4

closing here as the original topic has been resolved by adding lower bounds and in some cases shortcuts in the 0.3.16 to 0.3.21 timeframe. The suitesparse issue is also tracked in #2265 - I intend to perform another test run with current versions locally and/or in the GCC compile farm but preliminary results underline that the old claim of a 100x slowdown can only have been based on one unfortunate mix of unspecified versions and build options.

martin-frbg avatar Jan 14 '24 21:01 martin-frbg

Thanks, havent seen class of issues in a while and frankly forgot this was open.

brada4 avatar Jan 14 '24 23:01 brada4