QUERY: Reduced performance in certain architecture only to-be-regained by `OPENBLAS_NUM_THREADS=1`
Hello,
This is probably not a real issue for OpenBLAS but basically a request for information. Over SciPy, we have been receiving sporadic reports that, otherwise identical C translations of the old Fortran77 code was running substantially slower when thread number is not limited to 1.
https://github.com/scipy/scipy/issues/22438 https://github.com/scipy/scipy/issues/23161 https://github.com/scipy/scipy/issues/23191
The code in question is here (not sure it matters but for reference)
https://github.com/scipy/scipy/blob/main/scipy/optimize/__lbfgsb.c
and the only BLAS/LAPACK calls made in this code are
DAXPY
DSCAL
DCOPY
DNRM2
DDOT
DPOTRF
DTRTRS
I am trying to understand which call might be being affected since I don't quite understand why OPENBLAS_NUM_THREADS=1 recovers the performance. If this is needed at all times, probably we should, on the SciPy side, include some sort of a guard since users won't even know this setting is needed for comparable performance. And since we are using these functions in other parts of SciPy it would be nice to know when we are entering into such behavior.
This is a bit too nebulous to answer - I assume the original Fortran code you translated must have been making these same BLAS calls before, without this lack of performance ? If forcing OpenBLAS to use a single thread helps, it either means that some of its functions employ multithreading prematurely for the problem size (and architecture) involved, or OpenBLAS' multithreading is interfering with the threads already created by your SciPy environment, in particular if you are calling into OpenBLAS from multiple SciPy threads in parallel. Do you have any data on whether setting OPENBLAS_NUM_THREADS to a small value like 2 or 4 gives better or worse performance ?
Indeed, I am also a bit confused about this. Sorry for the vagueness. Let me ask around for 2 and 4 and report back.
I assume the original Fortran code you translated must have been making these same BLAS calls before, without this lack of performance ?
No it is a very old F77 codebase hence the BLAS/LAPACK sources were copy pasted directly from the reference implementation and compiled everything together.
No it is a very old F77 codebase hence the BLAS/LAPACK sources were copy pasted directly from the reference implementation and compiled everything together.
So the "old" code was already running single-threaded in all cases. This could be something like #5328 (SCAL employing multithreading too early on at least some modern hardware), aggravated by the fact that most OpenBLAS functions will use either one thread or as many as there are cpu cores (unless constrained by OPENBLAS_NUM_THREADS).
This is almost certain a bad interaction between multiple BLAS libraries loaded into the same process - the reproducer loads MKL, LP64 OpenBLAS and ILP64 OpenBLAS. It's showing up as a performance regression because the lbfgsb code got rewritten from not using a threaded BLAS to calling OpenBLAS, but it's a generic problem that has come up many times before. I'll try to help get to the bottom of it in https://github.com/scipy/scipy/issues/23191#issuecomment-3077339976.
Here is some results they provided
| OPENBLAS_NUM_THREADS | Runtime |
|---|---|
| 1 | 5.275556 |
| 2 | 5.761794 |
| 3 | 6.064222 |
| 4 | 20.362522 |
| 5 | 25.854847 |
| None | 28.160928 |
@rgommers This does not sound healthy at all . I'm happy to help with any remaining performance issues once you've resolved this, but I can afford to spend only limited time and mental energy on OpenBLAS right now and do not have time to read the linked SciPy issues. @ilayn thanks, that does not look as if intermediate thread counts would help in that particular case.
@martin-frbg Thank you regardless. I know it can be quite taxing reading some detective work. So please take your time and prioritize your well-being. I just wanted to know if you could recognize it directly or not. Otherwise we'll figure something out :)
Just in case, I post the original issue in our library:
- https://github.com/optuna/optuna/pull/6191
Another finding from my side (summary: when the incoming floating array input has 1 << 31 (2**31 in the Python style) or larger, I saw a significant slowdown on my Ubuntu machine):
- https://github.com/optuna/optuna/pull/6191#issuecomment-3061733090
Please note that the array is created via NumPy and the array data type is float (numpy.float64).
@rgommers This does not sound healthy at all . I'm happy to help with any remaining performance issues once you've resolved this, but I can afford to spend only limited time and mental energy on OpenBLAS right now and do not have time to read the linked SciPy issues.
100% agreed, not healthy at all. It's caused by the unhealthy vendoring process that's specific to Python's binary wheels for distributing on PyPI. In addition there may be some relevant change in one of the libraries involved, but that's unclear at this point.
Please don't worry about this issue, I'll assign myself here and won't ping you unless I'm sure it's an issue in OpenBLAS itself (and if so, hopefully with a clear diagnosis).
Thanks. I guess with run times in the low seconds, just having several competing BLAS libraries all race to create a bunch of idling (or even busy-waiting) threads would be noticable. But as mentioned, there is an open issue pertaining to the multithreading threshold in SCAL, and there is also another where parallel POTRF seems to show unexpected serialization.
@rgommers If you manage to have a local repro mechanism let me know so I can write dscal as a native C loop in a PR so we can test Martin's suspicion about ?SCAL.
If you can get it down to something mostly self-contained, it should also be trivial to adjust the threshold value in interface/scal.c (or the interface of any other BLAS function implicated) - actually just knowing the element count submitted to any BLAS call would clarify if it going to take multithreaded code paths.