OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

perf report shows most cycles spent in blas_thread_server

Open nickdesaulniers opened this issue 10 months ago • 9 comments

Not sure this is necessarily an issue with OpenBLAS vs users of OpenBLAS (numpy, pytorch).

I'm seeing slow python imports of pytorch; literally import pytorch is taking multiple seconds on my system.

When I record the python interpreter with linux perf record, perf report shows most cycles are spent in blas_thread_server via BOTH liblapack.so.3 and libcblas.so.3. i.e.

Overhead  Command  Shared Object             Symbol
  40.31%  python   liblapack.so.3            [.] blas_thread_server
  36.85%  python   libcblas.so.3             [.] blas_thread_server

If I annotate either, it seems both are near reading the time stamp counter:

  0.31 │3c:┌─→mov   (%r15),%rax                                                                                      ▒
       │   │  cmp   $0x1,%rax                                                                                        ▒
       │   │↓ ja    b0                                                                                               ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
  5.29 │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  rdtsc                                                                                                  ◆
 91.82 │   │  sub   %ecx,%eax                                                                                        ▒
       │   │  cmp   %eax,thread_timeout                                                                              ▒
  2.59 │   └──jae   3c

I'm guessing that's corresponding to code around here.

https://github.com/numpy/numpy/issues/24639 seems like someone else hit this, too, but...https://xkcd.com/979/.

How do I even go about debugging this further? Is it an issue in pytorch? numpy? openblas? PEBKAC?

Importing numpy alone doesn't seem problematic, though I suspect that it's part of the chain of dependencies here. Perhaps related to how pytorch is (mis)using numpy then???

nickdesaulniers avatar Mar 05 '25 23:03 nickdesaulniers

That reads like you have threads idling while there is nothing for them to do - what hardware is this, which version of OpenBLAS, do you (or pytorch) do anything to constrain the number of threads OpenBLAS will use ? (by default, it will start up as many as there are cpu cores)

martin-frbg avatar Mar 06 '25 08:03 martin-frbg

It'd be great to get an exact reproducer for this @nickdesaulniers. Without knowing how you installed numpy, pytorch and whatever else is in your environment (venv or conda env?), this isn't reproducible.

rgommers avatar Mar 06 '25 12:03 rgommers

what hardware is this

$ uname -m
x86_64

which version of OpenBLAS

Hard to say, exactly. Numpy can print some info about how it was configured, but I don't see any version info in its output.

https://numpy.org/doc/1.24/reference/generated/numpy.show_config.html

I am using numpy '1.23.5' from an environment managed by conda.

Given the .so files, is it possible the check the version from that?

do you (or pytorch) do anything to constrain the number of threads OpenBLAS will use ?

Unsure. What are some symbols I can grep for to check (in numpy and pytorch sources)?

It'd be great to get an exact reproducer for this @nickdesaulniers.

Yeah, sorry. I know it's not the greatest bug report ever.

Without knowing how you installed numpy, pytorch and whatever else is in your environment (venv or conda env?)

We're definitely using conda to manage our python packages.

Right, I'm not even sure myself if this is an issue with my env, conda's packaging/configuration of numpy or openblas itself. Filing something in case this has been brought up before and is familiar, or for future travelers to find a similar thread.

nickdesaulniers avatar Mar 06 '25 18:03 nickdesaulniers

what hardware is this

$ uname -m x86_64

thanks - that's a bit more specific than "a computer", but it might help to know if it has like 4 or 400 cores ?

Hard to say, exactly. Numpy can print some info about how it was configured, but I don't see any version info in its output.

Strange, I'd think Numpy's show_config displays the version information, unless you are using a really ancient version of OpenBLAS that does not yet include the version number in its response to openblas_get_config() - this was added in 0.3.4 a bit over 6 years ago...

Unsure. What are some symbols I can grep for to check (in numpy and pytorch sources)?

See if you can find any calls to openblas_set_num_threads() - or anything that sets the environment variable OPENBLAS_NUM_THREADS.

You could also try to constrain it yourself, by setting OPENBLAS_NUM_THREADS to a small(ish) value before starting your numpy/pytorch environment. Depending on the problem (matrix) sizes you are working with, having a large number of threads may not be beneficial, as each needs to allocate memory on startup and most will end up idling if they're only lying in ambush, waiting for the occasional 4x4 matrix to stumble by

martin-frbg avatar Mar 07 '25 15:03 martin-frbg

thanks - that's a bit more specific than "a computer", but it might help to know if it has like 4 or 400 cores ?

$ lscpu
...
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  256
  On-line CPU(s) list:   0-255
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7763 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           2
...

I've definitely seen strange performance on multithreaded code on dual socket NUMA machines before...wonder if that's related, or just a red herring. IIRC, using man 8 numactl and seeing speedups limiting processes to single sockets was happening. Maybe I should play with that again here and see if it makes a difference.

Strange, I'd think Numpy's show_config displays the version information, unless you are using a really ancient version of OpenBLAS that does not yet include the version number in its response to openblas_get_config() - this was added in 0.3.4 a bit over 6 years ago...

$ python3
>>> import numpy as np
>>> np.show_config()
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

>>> print (np.__file__)
<redacted>/conda/envs/<redacted>/lib/python3.11/site-packages/numpy/__init__.py

$ strings <redacted>/sw/conda/envs/<redacted>/lib/python3.11/site-packages/numpy/core/libcblas.so.3 | grep -i 0\\.
OpenBLAS 0.3.29 DYNAMIC_ARCH NO_CH NO_AFFINITY
...

so guess I'm running 0.3.29, which at the moment looks like your latest release.

Looks like CNAME is the function that returns that configuration string. A quick grep for CNAME turned up no hits in numpy's sources, so I don't think numpy exposes the exact openblas version string to users in python. But I don't know much about python to C bindings, so I may have missed it with my simple grep.

Oh, they do have this but it doesn't seem to work.

See if you can find any calls to openblas_set_num_threads() - or anything that sets the environment variable OPENBLAS_NUM_THREADS.

No hits for openblas_set_num_threads or OPENBLAS_NUM_THREADS in numpy sources.

You could also try to constrain it yourself, by setting OPENBLAS_NUM_THREADS to a small(ish) value before starting your numpy/pytorch environment.

$ for i in $(seq 1 30); do python -c "import time; z=time.time(); import torch; print(time.time() - z)" > /tmp/allthreads.txt; done
$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " /tmp/allthreads.txt
ave: 1.71255

$ for i in $(seq 1 30); do OPENBLAS_NUM_THREADS=1 python -c "import time; z=time.time(); import torch; print(time.time() - z)" > /tmp/onethread.txt; done
$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " /tmp/onethread.txt
ave: 1.97009

Didn't seem to make a difference, but there's a lot of variance.

nickdesaulniers avatar Mar 07 '25 17:03 nickdesaulniers

Ok, thanks. No idea why the show_config() would not be more specific - besides the version number, it should tell us the maximum number of threads the library was built for (which may well be smaller than 256) Symbol name CNAME gets filled in by the build system, so this is the openblas_get_config() function I mentioned. NUMA effects might play a role, though probably not already during initialization, I'd think if anything it would increase latency if threads started wandering between nodes, or data was shared between threads on different nodes under load. But startup being actually slower with just a single thread would seem to rule out NUMA (as well as anything else about allocating memory buffers per thread etc). Maybe the problem is not in OpenBLAS itself, and the idling that is so prominent in your perf record happens only after whatever causes the initial slowness. BTW what happens when you try an intermediate number, say OPENBLAS_NUM_THREADS=32 ?

martin-frbg avatar Mar 07 '25 18:03 martin-frbg

I've seen similar issues. In a brief exploration it seemed all the threads basically busy wait for something like 100ms after they get any work. So if in numpy if you invert a single 4x4 matrix every 100ms, you end up with something that uses like 3 full cores of CPU unless you limit the max blas threads to 1.

Here's some example code:

import numpy as np
import time

# uncomment me to "fix" cpu usage
#from threadpoolctl import threadpool_limits
#threadpool_limits(limits=1, user_api='blas')

mat = np.eye(4)

while True:
    mat = np.linalg.inv(mat)
    time.sleep(0.1)

I will note that this issue appears to go away in numpy >= 2

matthew-lidar avatar Apr 08 '25 23:04 matthew-lidar

fwiw, while debugging a similar issue in our environment, I discovered you can tune the busy-loop behavior using OPENBLAS_THREAD_TIMEOUT. If I modify your script to loop for 50 iterations, I see this result:

$ time python busyloop.py
python busyloop.py  2.79s user 18.97s system 416% cpu 5.222 total
$ time OPENBLAS_THREAD_TIMEOUT=1 python busyloop.py
OPENBLAS_THREAD_TIMEOUT=1 python busyloop.py  0.14s user 0.07s system 4% cpu 5.207 total
$ time OMP_NUM_THREADS=1 python busyloop.py
OMP_NUM_THREADS=1 python busyloop.py  0.11s user 0.06s system 3% cpu 5.173 total

nelhage avatar May 15 '25 18:05 nelhage

OPENBLAS_THREAD_TIMEOUT is the log2 of the number of rdtsc cycles to spin, and is clamped to be in [4, 30]: https://github.com/OpenMathLib/OpenBLAS/blob/8afddc1a819f81cce2982bd51ddb7d2403929648/driver/others/blas_server.c#L572-L577

nelhage avatar May 15 '25 18:05 nelhage