OpenBLAS
OpenBLAS copied to clipboard
?getrf performance degrades after some cache limit passed
I am using a Windows/Mingw64 built OpenBLAS copy that I further built NumPy and SciPy with. The script I used is
march="x86-64"
extra="-fno-asynchronous-unwind-tables"
vc_arch="X64"
cflags="-O2 -march=$march -mtune=generic $extra"
fflags="$cflags -frecursive -ffpe-summary=invalid,zero"
# Build name for output library from gcc version and OpenBLAS commit.
GCC_TAG="gcc_$(gcc -dumpversion | tr .- _)"
OPENBLAS_VERSION=$(git describe --tags)
# Build OpenBLAS
# Variable used in creating output libraries
export LIBNAMESUFFIX=${OPENBLAS_VERSION}-${GCC_TAG}
make BINARY=$BUILD_BITS DYNAMIC_ARCH=1 USE_THREAD=1 USE_OPENMP=0 NO_WARMUP=1 BUILD_LAPACK_DEPRECATED=0 COMMON_OPT="$cflags" FCOMMON_OPT="$fflags"
make install PREFIX=$OPENBLAS_ROOT/$BUILD_BITS
I have a Cythonized code that up to a point provides quite nice speedups but at some point which I presume the cache limit, the performance differences disappears and both gets tied up. After optimizing everything else, I started linetracing and indeed at some threshold value which is machine dependent there is a critical point which I narrowed down to ?getrf. Just by incresing the problem type I can consistenly replicate the issue. A typical result I keep on seeing are these sudden performance jumps
I am not sure if this is due to wrong branching hence using ?getrf2 or some build issue. Here are some funny looking stats
import numpy as np
import scipy.linalg as la
for n in range(50, 120, 5):
zzz = np.random.rand(n, n)
%timeit la.lu_factor(zzz)
33.4 µs ± 889 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39.1 µs ± 1.17 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
42.4 µs ± 1.01 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
51.2 µs ± 3.63 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
50.9 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
55.7 µs ± 724 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
58.7 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
69.6 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
74.6 µs ± 981 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
85.2 µs ± 1.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.8 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) # <--- Here is the jump
1.86 ms ± 80.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.04 ms ± 259 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.09 ms ± 14.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
This is visible in my both machines (both built at 0.3.18
tag) DELL XPS 15 from 6 years ago and a very new XPS 15, both give the following
Old machine :
AddressWidth=64
Architecture=9
Availability=3
Caption=Intel64 Family 6 Model 94 Stepping 3
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2601
CurrentVoltage=9
DataWidth=64
Description=Intel64 Family 6 Model 94 Stepping 3
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=100
Family=198
InstallDate=
L2CacheSize=1024
L2CacheSpeed=
LastErrorCode=
Level=6
LoadPercentage=5
Manufacturer=GenuineIntel
MaxClockSpeed=2601
Name=Intel(R) Core(TM) i7-6700HQ CPU @ 2.60GHz
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=BFEBFBFF000506E3
ProcessorType=3
Revision=24067
Role=CPU
SocketDesignation=U3E1
Status=OK
StatusInfo=3
Stepping=
SystemCreationClassName=Win32_ComputerSystem
SystemName=XXXXXX
UniqueId=
UpgradeMethod=1
Version=
VoltageCaps=
and new machine
AddressWidth=64
Architecture=9
Availability=3
Caption=Intel64 Family 6 Model 141 Stepping 1
ConfigManagerErrorCode=
ConfigManagerUserConfig=
CpuStatus=1
CreationClassName=Win32_Processor
CurrentClockSpeed=2304
CurrentVoltage=7
DataWidth=64
Description=Intel64 Family 6 Model 141 Stepping 1
DeviceID=CPU0
ErrorCleared=
ErrorDescription=
ExtClock=100
Family=198
InstallDate=
L2CacheSize=10240
L2CacheSpeed=
LastErrorCode=
Level=6
LoadPercentage=31
Manufacturer=GenuineIntel
MaxClockSpeed=2304
Name=11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
OtherFamilyDescription=
PNPDeviceID=
PowerManagementCapabilities=
PowerManagementSupported=FALSE
ProcessorId=BFEBFBFF000806D1
ProcessorType=3
Revision=
Role=CPU
SocketDesignation=U3E1
Status=OK
StatusInfo=3
Stepping=
SystemCreationClassName=Win32_ComputerSystem
SystemName=XXXXX
UniqueId=
UpgradeMethod=1
Version=
VoltageCaps=
I am not sure how this can come to be since I think OpenBLAS has rolled its own multithreaded ?getrf
and I guess so far it didn't cause any issue. Thus it led me to the conclusion that maybe the target is not correct in my build. Or rather unlikely, I'm hitting an obscure bug somewhere.
old machine numpy config
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_info']
libraries = ['openblas_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_info']
libraries = ['openblas_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
libraries = ['openblas_lapack_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
library_dirs = ['C:\\Users\\Ilhan Polat\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
libraries = ['openblas_lapack_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
new machine numpy config
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_info']
libraries = ['openblas_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_info']
libraries = ['openblas_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
libraries = ['openblas_lapack_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
library_dirs = ['C:\\Users\\ilhan\\Documents\\GitHub\\numpy\\build\\openblas_lapack_info']
libraries = ['openblas_lapack_info']
language = f77
define_macros = [('HAVE_CBLAS', None)]
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2,AVX512F,AVX512CD,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
not found =
I am not sure if this is sufficient information but I don't know how to pull more intricate details however I'd be happy to provide more if you can give me some pointers.
Have not looked at this in detail yet, but dgetrf switches between single- and multithreading at m*n =10000
irrespective of cache size or cpu model (interface/lapack/getrf.c)
Within getrf, forwarding to getf2 happens only when the lesser of m and n is smaller than 10 for any submatrix (if I read the formula for "blocking" in lapack/getrf/getrf_(single|parallel).c correctly
yes indeed and there we start seeing what seems to be a lot of cache misses
The switch between single and multithreading was introduced by me in 0.3.16, maybe I did not benchmark carefully enough. The conditional for forwarding to getf2 seems to be basically unchanged from GotoBLAS' time
Hmm maybe the I should test it with previous versions then.
I don't think switching instance is the problem but after switching the blocking or recursion parameters maybe don't match the architecture cache sizes or something like that. Because I can't imagine that this went unnoticed worldwide with such a central subroutine.
Switching happens before any blocking or recursion factors are calculated, so all I can think of at the moment is that the crossover point (in terms of matrix size) may be misplaced a bit - or it may have been only some artefact of my hardware or benchmarks that made the change look beneficial. (And i can never be sure who actually uses - and benchmarks - the latest release rather than what came with their Linux distribution of choice or conveniently included in some other software package that only gets updated every other year. And of those who do, how many will raise the alarm rather than quietly roll back to their previous version...)
Ah :) I know exactly how you feel when I see SciPy versions from 2016.
So my presumption was wrong since I was reading the netlib implementation (line 164) and thought that there are some runtime decisions (typically ilaenv
in the reference implementations) but apparently not.
I'll try to dig a bit deeper and in the meantime I've asked other users to provide some input in the linked SciPy issue. Thanks Martin, always helpful.
Can you make another graph setting OPENBLAS_NUM_THREADS=1 ? Maybe we see the breaking point where graphs cross accidentally.
I have made some interesting progress. I've used MKL just to check and all the jumps disappeared in both machines. Then I've returned back to OpenBLAS and checked ?copy
and ?scal
behavior with increasing sizes and they showed milder versions of such jumps around 150 and 300+.
This and also the recent issue with https://github.com/scipy/scipy/issues/14886 makes me believe that there might be an inaccurate architecture selection somehow. But this is all speculation of course. If you have any experiments that I can perform please let me know to identify further.
Just stealing what I have learned from sklearn
folks, here is something I have found out about OpenBLAS versions. I'll try to fix that first
>>> python -m threadpoolctl -i sklearn
Core: Haswell
Core: Haswell
[
{
"user_api": "openmp",
"internal_api": "openmp",
"prefix": "vcomp",
"filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\sklearn\\.libs\\vcomp140.dll",
"version": null,
"num_threads": 8
},
{
"user_api": "blas",
"internal_api": "openblas",
"prefix": "libopenblas",
"filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\numpy-1.22.0.dev0+1754.ge7f773cf4-py3.9-win-amd64.egg\\numpy\\.libs\\libopenblas.J5JINWQ3YOZOQH4CCKPXHDQVK5OI6HXB.gfortran-win_amd64.dll",
"version": "0.3.18",
"threading_layer": "pthreads",
"architecture": "Haswell",
"num_threads": 8
},
{
"user_api": "blas",
"internal_api": "openblas",
"prefix": "libopenblas",
"filepath": "C:\\Users\\Ilhan Polat\\AppData\\Local\\Programs\\Python\\Python39\\Lib\\site-packages\\scipy-1.8.0.dev0+2047.048bac7-py3.9-win-amd64.egg\\scipy\\.libs\\libopenblas.RCXR3FYXQ7KLTLGA2TVNYAUS6DAQNZ6T.gfortran-win_amd64.dll",
"version": "0.3.17",
"threading_layer": "pthreads",
"architecture": "Haswell",
"num_threads": 8
}
]
This is something to do with my compilation on Win10. I have a TigerLake CPU and when I compile with MinGW UCRT 64 it selects SkylakeX and later on it leads to strange segfaults that I don't have sufficient debug-fu to understand what is going on. Changing to a "generic" compiled version solves many of my issues. So I'll just close this to reduce clutter.
You could try haswell target, thats close enough to skylakex and is 5 years older.