OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

QUERY: Reduced performance in certain architecture only to-be-regained by `OPENBLAS_NUM_THREADS=1`

Open ilayn opened this issue 5 months ago • 14 comments

Hello,

This is probably not a real issue for OpenBLAS but basically a request for information. Over SciPy, we have been receiving sporadic reports that, otherwise identical C translations of the old Fortran77 code was running substantially slower when thread number is not limited to 1.

https://github.com/scipy/scipy/issues/22438 https://github.com/scipy/scipy/issues/23161 https://github.com/scipy/scipy/issues/23191

The code in question is here (not sure it matters but for reference)

https://github.com/scipy/scipy/blob/main/scipy/optimize/__lbfgsb.c

and the only BLAS/LAPACK calls made in this code are

DAXPY
DSCAL
DCOPY
DNRM2
DDOT

DPOTRF
DTRTRS

I am trying to understand which call might be being affected since I don't quite understand why OPENBLAS_NUM_THREADS=1 recovers the performance. If this is needed at all times, probably we should, on the SciPy side, include some sort of a guard since users won't even know this setting is needed for comparable performance. And since we are using these functions in other parts of SciPy it would be nice to know when we are entering into such behavior.

ilayn avatar Jul 16 '25 06:07 ilayn

This is a bit too nebulous to answer - I assume the original Fortran code you translated must have been making these same BLAS calls before, without this lack of performance ? If forcing OpenBLAS to use a single thread helps, it either means that some of its functions employ multithreading prematurely for the problem size (and architecture) involved, or OpenBLAS' multithreading is interfering with the threads already created by your SciPy environment, in particular if you are calling into OpenBLAS from multiple SciPy threads in parallel. Do you have any data on whether setting OPENBLAS_NUM_THREADS to a small value like 2 or 4 gives better or worse performance ?

martin-frbg avatar Jul 16 '25 07:07 martin-frbg

Indeed, I am also a bit confused about this. Sorry for the vagueness. Let me ask around for 2 and 4 and report back.

ilayn avatar Jul 16 '25 07:07 ilayn

I assume the original Fortran code you translated must have been making these same BLAS calls before, without this lack of performance ?

No it is a very old F77 codebase hence the BLAS/LAPACK sources were copy pasted directly from the reference implementation and compiled everything together.

ilayn avatar Jul 16 '25 07:07 ilayn

No it is a very old F77 codebase hence the BLAS/LAPACK sources were copy pasted directly from the reference implementation and compiled everything together.

So the "old" code was already running single-threaded in all cases. This could be something like #5328 (SCAL employing multithreading too early on at least some modern hardware), aggravated by the fact that most OpenBLAS functions will use either one thread or as many as there are cpu cores (unless constrained by OPENBLAS_NUM_THREADS).

martin-frbg avatar Jul 16 '25 07:07 martin-frbg

This is almost certain a bad interaction between multiple BLAS libraries loaded into the same process - the reproducer loads MKL, LP64 OpenBLAS and ILP64 OpenBLAS. It's showing up as a performance regression because the lbfgsb code got rewritten from not using a threaded BLAS to calling OpenBLAS, but it's a generic problem that has come up many times before. I'll try to help get to the bottom of it in https://github.com/scipy/scipy/issues/23191#issuecomment-3077339976.

rgommers avatar Jul 16 '25 07:07 rgommers

Here is some results they provided

OPENBLAS_NUM_THREADS Runtime
1 5.275556
2 5.761794
3 6.064222
4 20.362522
5 25.854847
None 28.160928

ilayn avatar Jul 16 '25 07:07 ilayn

@rgommers This does not sound healthy at all . I'm happy to help with any remaining performance issues once you've resolved this, but I can afford to spend only limited time and mental energy on OpenBLAS right now and do not have time to read the linked SciPy issues. @ilayn thanks, that does not look as if intermediate thread counts would help in that particular case.

martin-frbg avatar Jul 16 '25 07:07 martin-frbg

@martin-frbg Thank you regardless. I know it can be quite taxing reading some detective work. So please take your time and prioritize your well-being. I just wanted to know if you could recognize it directly or not. Otherwise we'll figure something out :)

ilayn avatar Jul 16 '25 07:07 ilayn

Just in case, I post the original issue in our library:

  • https://github.com/optuna/optuna/pull/6191

nabenabe0928 avatar Jul 16 '25 08:07 nabenabe0928

Another finding from my side (summary: when the incoming floating array input has 1 << 31 (2**31 in the Python style) or larger, I saw a significant slowdown on my Ubuntu machine):

  • https://github.com/optuna/optuna/pull/6191#issuecomment-3061733090

Please note that the array is created via NumPy and the array data type is float (numpy.float64).

nabenabe0928 avatar Jul 16 '25 08:07 nabenabe0928

@rgommers This does not sound healthy at all . I'm happy to help with any remaining performance issues once you've resolved this, but I can afford to spend only limited time and mental energy on OpenBLAS right now and do not have time to read the linked SciPy issues.

100% agreed, not healthy at all. It's caused by the unhealthy vendoring process that's specific to Python's binary wheels for distributing on PyPI. In addition there may be some relevant change in one of the libraries involved, but that's unclear at this point.

Please don't worry about this issue, I'll assign myself here and won't ping you unless I'm sure it's an issue in OpenBLAS itself (and if so, hopefully with a clear diagnosis).

rgommers avatar Jul 16 '25 08:07 rgommers

Thanks. I guess with run times in the low seconds, just having several competing BLAS libraries all race to create a bunch of idling (or even busy-waiting) threads would be noticable. But as mentioned, there is an open issue pertaining to the multithreading threshold in SCAL, and there is also another where parallel POTRF seems to show unexpected serialization.

martin-frbg avatar Jul 16 '25 09:07 martin-frbg

@rgommers If you manage to have a local repro mechanism let me know so I can write dscal as a native C loop in a PR so we can test Martin's suspicion about ?SCAL.

ilayn avatar Jul 16 '25 11:07 ilayn

If you can get it down to something mostly self-contained, it should also be trivial to adjust the threshold value in interface/scal.c (or the interface of any other BLAS function implicated) - actually just knowing the element count submitted to any BLAS call would clarify if it going to take multithreaded code paths.

martin-frbg avatar Jul 16 '25 12:07 martin-frbg