OpenBLAS ?getrf performance degrades after some cache limit passed

I am using a Windows/Mingw64 built OpenBLAS copy that I further built NumPy and SciPy with. The script I used is

march="x86-64"
extra="-fno-asynchronous-unwind-tables"
vc_arch="X64"
cflags="-O2 -march=$march -mtune=generic $extra"
fflags="$cflags -frecursive -ffpe-summary=invalid,zero"

# Build name for output library from gcc version and OpenBLAS commit.
GCC_TAG="gcc_$(gcc -dumpversion | tr .- _)"
OPENBLAS_VERSION=$(git describe --tags)
# Build OpenBLAS
# Variable used in creating output libraries
export LIBNAMESUFFIX=${OPENBLAS_VERSION}-${GCC_TAG}
make BINARY=$BUILD_BITS DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=0  NO_WARMUP=1 BUILD_LAPACK_DEPRECATED=0 COMMON_OPT="$cflags" FCOMMON_OPT="$fflags"
make install PREFIX=$OPENBLAS_ROOT/$BUILD_BITS

I have a Cythonized code that up to a point provides quite nice speedups but at some point which I presume the cache limit, the performance differences disappears and both gets tied up. After optimizing everything else, I started linetracing and indeed at some threshold value which is machine dependent there is a critical point which I narrowed down to ?getrf. Just by incresing the problem type I can consistenly replicate the issue. A typical result I keep on seeing are these sudden performance jumps

I am not sure if this is due to wrong branching hence using ?getrf2 or some build issue. Here are some funny looking stats

import numpy as np
import scipy.linalg as la

for n in range(50, 120, 5):
    zzz = np.random.rand(n, n)
    %timeit la.lu_factor(zzz)
    
33.4 µs ± 889 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39.1 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
42.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
51.2 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.9 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
55.7 µs ± 724 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.7 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.6 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.6 µs ± 981 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.8 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   #  <--- Here is the jump
1.86 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.04 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

This is visible in my both machines (both built at 0.3.18 tag) DELL XPS 15 from 6 years ago and a very new XPS 15, both give the following

Old machine :

AddressWidth=64
Architecture=9
Availability=3
Caption=Intel64 Family 6 Model 94 Stepping 3
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2601
CurrentVoltage=9
DataWidth=64
Description=Intel64 Family 6 Model 94 Stepping 3
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=100
Family=198
InstallDate=
L2CacheSize=1024
L2CacheSpeed=
LastErrorCode=
Level=6
LoadPercentage=5
Manufacturer=GenuineIntel
MaxClockSpeed=2601
Name=Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=BFEBFBFF000506E3
ProcessorType=3
Revision=24067
Role=CPU
SocketDesignation=U3E1
Status=OK
StatusInfo=3
Stepping=
SystemCreationClassName=Win32_ComputerSystem
SystemName=XXXXXX
UniqueId=
UpgradeMethod=1
Version=
VoltageCaps=

and new machine

AddressWidth=64
Architecture=9
Availability=3
Caption=Intel64 Family 6 Model 141 Stepping 1
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2304
CurrentVoltage=7
DataWidth=64
Description=Intel64 Family 6 Model 141 Stepping 1
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=100
Family=198
InstallDate=
L2CacheSize=10240
L2CacheSpeed=
LastErrorCode=
Level=6
LoadPercentage=31
Manufacturer=GenuineIntel
MaxClockSpeed=2304
Name=11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=BFEBFBFF000806D1
ProcessorType=3
Revision=
Role=CPU
SocketDesignation=U3E1
Status=OK
StatusInfo=3
Stepping=
SystemCreationClassName=Win32_ComputerSystem
SystemName=XXXXX
UniqueId=
UpgradeMethod=1
Version=
VoltageCaps=

I am not sure how this can come to be since I think OpenBLAS has rolled its own multithreaded ?getrf and I guess so far it didn't cause any issue. Thus it led me to the conclusion that maybe the target is not correct in my build. Or rather unlikely, I'm hitting an obscure bug somewhere.

old machine numpy config

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

new machine numpy config

blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_info']
    libraries = ['openblas_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
    library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
    libraries = ['openblas_lapack_info']
    language = f77
    define_macros = [('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
    not found =

I am not sure if this is sufficient information but I don't know how to pull more intricate details however I'd be happy to provide more if you can give me some pointers.

Nov 12 '21 20:11 ilayn

Have not looked at this in detail yet, but dgetrf switches between single- and multithreading at m*n =10000 irrespective of cache size or cpu model (interface/lapack/getrf.c) Within getrf, forwarding to getf2 happens only when the lesser of m and n is smaller than 10 for any submatrix (if I read the formula for "blocking" in lapack/getrf/getrf_(single|parallel).c correctly

Nov 12 '21 21:11 martin-frbg

yes indeed and there we start seeing what seems to be a lot of cache misses

Nov 12 '21 21:11 ilayn

The switch between single and multithreading was introduced by me in 0.3.16, maybe I did not benchmark carefully enough. The conditional for forwarding to getf2 seems to be basically unchanged from GotoBLAS' time

Nov 12 '21 22:11 martin-frbg

Hmm maybe the I should test it with previous versions then.

I don't think switching instance is the problem but after switching the blocking or recursion parameters maybe don't match the architecture cache sizes or something like that. Because I can't imagine that this went unnoticed worldwide with such a central subroutine.

Nov 12 '21 22:11 ilayn

Switching happens before any blocking or recursion factors are calculated, so all I can think of at the moment is that the crossover point (in terms of matrix size) may be misplaced a bit - or it may have been only some artefact of my hardware or benchmarks that made the change look beneficial. (And i can never be sure who actually uses - and benchmarks - the latest release rather than what came with their Linux distribution of choice or conveniently included in some other software package that only gets updated every other year. And of those who do, how many will raise the alarm rather than quietly roll back to their previous version...)

Nov 12 '21 22:11 martin-frbg

Ah :) I know exactly how you feel when I see SciPy versions from 2016.

So my presumption was wrong since I was reading the netlib implementation (line 164) and thought that there are some runtime decisions (typically ilaenv in the reference implementations) but apparently not.

I'll try to dig a bit deeper and in the meantime I've asked other users to provide some input in the linked SciPy issue. Thanks Martin, always helpful.

Nov 12 '21 23:11 ilayn

Can you make another graph setting OPENBLAS_NUM_THREADS=1 ? Maybe we see the breaking point where graphs cross accidentally.

Nov 14 '21 16:11 brada4

I have made some interesting progress. I've used MKL just to check and all the jumps disappeared in both machines. Then I've returned back to OpenBLAS and checked ?copy and ?scal behavior with increasing sizes and they showed milder versions of such jumps around 150 and 300+.

This and also the recent issue with https://github.com/scipy/scipy/issues/14886 makes me believe that there might be an inaccurate architecture selection somehow. But this is all speculation of course. If you have any experiments that I can perform please let me know to identify further.

Nov 16 '21 11:11 ilayn

Just stealing what I have learned from sklearn folks, here is something I have found out about OpenBLAS versions. I'll try to fix that first

>>> python -m threadpoolctl -i sklearn
Core: Haswell
Core: Haswell
[
  {
    "user_api": "openmp",
    "internal_api": "openmp",
    "prefix": "vcomp",
    "filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\sklearn\\.libs\\vcomp140.dll",
    "version": null,
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\numpy-1.22.0.dev0+1754.ge7f773cf4-py3.9-win-amd64.egg\\numpy\\.libs\\libopenblas.J5JINWQ3YOZOQH4CCKPXHDQVK5OI6HXB.gfortran-win_amd64.dll",
    "version": "0.3.18",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 8
  },
  {
    "user_api": "blas",
    "internal_api": "openblas",
    "prefix": "libopenblas",
    "filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\scipy-1.8.0.dev0+2047.048bac7-py3.9-win-amd64.egg\\scipy\\.libs\\libopenblas.RCXR3FYXQ7KLTLGA2TVNYAUS6DAQNZ6T.gfortran-win_amd64.dll",
    "version": "0.3.17",
    "threading_layer": "pthreads",
    "architecture": "Haswell",
    "num_threads": 8
  }
]

Nov 16 '21 11:11 ilayn

This is something to do with my compilation on Win10. I have a TigerLake CPU and when I compile with MinGW UCRT 64 it selects SkylakeX and later on it leads to strange segfaults that I don't have sufficient debug-fu to understand what is going on. Changing to a "generic" compiled version solves many of my issues. So I'll just close this to reduce clutter.

Jan 29 '23 12:01 ilayn

You could try haswell target, thats close enough to skylakex and is 5 years older.

Jan 29 '23 14:01 brada4

OpenBLAS OpenBLAS copied to clipboard

?getrf performance degrades after some cache limit passed

OpenBLAS
OpenBLAS copied to clipboard