LinearSolve.jl
LinearSolve.jl copied to clipboard
Benchmarking Scripts to tune the default algorithm choices
We should put together a benchmark script and have a bunch of people run it. It should just run LUFactorization, RFLUFactorization, and FastLUFactorization (and MKLFactorization when that exists).
It would be nice for this to have an option for what kind of matrix is generated as a function of some N, so for example it can be used to generate the matrices from the Brusselator equation for testing the sparse factorizations.
This is a continuation of the benchmarking I presented in #159. There, I presented the results of running the perf/lu.jl script of the RecursiveFactorization package for a Linux desktop machine. I repeat that exercise here, after correcting a bug in the script (see this PR). The results below are for a Windows desktop machine with the following configuration:
(RecursiveFactorization) pkg> status
Project RecursiveFactorization v0.2.11
Status `D:\peter\Documents\julia\dev\RecursiveFactorization\Project.toml`
[a93c6f00] DataFrames v1.3.4
[bdcacae8] LoopVectorization v0.12.120
[33e6dc65] MKL v0.5.0
[f517fe37] Polyester v0.6.13
[7792a7ef] StrideArraysCore v0.3.15
[d5829a12] TriangularSolve v0.1.12
[3d5dd08c] VectorizationBase v0.21.42
[112f6efa] VegaLite v2.6.0
[37e2e46d] LinearAlgebra
julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Windows (x86_64-w64-mingw32)
Microsoft Windows [Version 10.0.22000.795]
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz:
speed user nice sys idle irq
#1 3000 MHz 2431453 0 19001781 516223109 13460718 ticks
#2 3000 MHz 5596203 0 3129484 528930375 187468 ticks
#3 3000 MHz 3831859 0 3086890 530737312 40078 ticks
#4 3000 MHz 3733187 0 1884296 532038578 28156 ticks
#5 3000 MHz 2444562 0 2263937 532947562 36484 ticks
#6 3000 MHz 2220062 0 1327734 534108265 28578 ticks
#7 3000 MHz 2176875 0 1391343 534087843 28593 ticks
#8 3000 MHz 2917562 0 1682937 533055546 53328 ticks
Memory: 31.85821533203125 GB (22317.58984375 MB free)
Uptime: 537656.0 sec
Load Avg: 0.0 0.0 0.0
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = runemacs.exe
CHOCOLATEYLASTPATHUPDATE = 132198172845121191
HOME = D:\peter\Documents
HOMEDRIVE = C:
HOMEPATH = \Users\peter
MIC_LD_LIBRARY_PATH = C:\Program Files (x86)\Common Files\Intel\Shared Libraries\compiler\lib\intel64_win_mic
PATH = C:\Program Files\ImageMagick-7.1.0-Q16-HDRI;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2020.1.216\windows\mpi\intel64\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\compiler;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\compiler;C:\Program Files (x86)\Common Files\Microsoft Shared\VSA\10.0\VsaEnv;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Windows\twain_32\MP830;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\gs\gs8.64\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files\Calibre2\;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\Git\bin;C:\Program Files\TortoiseGit\bin;C:\ProgramData\chocolatey\bin;C:\Program Files\MATLAB\R2022a\bin;C:\Program Files (x86)\Calibre2\;C:\Program Files\Microsoft VS Code\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Git\cmd;C:\Program Files\nodejs\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;c:\usr\local\bin;C:\Users\peter\AppData\Local\Programs\MiKTeX 2.9\miktex\bin\x64\;C:\Users\peter\AppData\Local\GitHubDesktop\bin;C:\Users\peter\AppData\Local\Pandoc\;C:\cygwin64\usr\i686-w64-mingw32\sys-root\mingw\lib;C:\Program Files (x86)\Aspell\bin;C:\Users\peter\AppData\Local\gitkraken\bin;C:\Users\peter\AppData\Roaming\npm;C:\Users\peter\AppData\Local\Microsoft\WindowsApps
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.JL;.CPL
PSMODULEPATH = D:\peter\Documents\WindowsPowerShell\Modules;C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules
First, the result of running the script after starting Julia with -t 8 and with the default OpenBLAS BLAS:
OpenBLAS performance seems incredibly bad!
Next, the 8-core result after using MKL:

Next, the OpenBLAS result after starting Julia with -t 1:

And finally, the MKL result after starting Julia with -t 1:

Out of curiosity I commented out the line in the script
#BLAS.set_num_threads(nc)
and restarted Julia with -t 8 using OpenBLAS. The result is:

That didn't seem to help.
Ah, re what I said earlier about using RF with 1 vs multiple threads: It doesn't benefit much from threading, but unlike OpenBLAS, Polyester won't weave a noose to hang itself with given enough threads.
OpenBLAS is dramatically faster with BLAS.set_num_threads(1) over this size range on most recent computers with many cores.
These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.
These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.
What do you get on your AMD machine? I'm guessing MKL is still much better there for lu?
For Apple silicon, we could use Accelerate. It does quite well with a single core:

But that single core gets to use their single matrix multiplier.
With 4 cores:
OpenBLAS actually wins.
I should double check if Accelerate can benefit from multiple cores on the M1; maybe I just didn't set it. While it has only a single matrix multiplier, there's probably a lot of other things that can be done on the cores.
This was with 4 cores on a mac mini. I'm guessing OpenBLAS would hang itself again on an M1-max/ultra, given 8 threads.
RF wins below 100x100, at least.
@YingboMa and I should look into an algorithm better suited to threading.
How do you make use of Accelerate?
https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl
I gave it a deliberately bad name to avoid stealing a good one, hoping someone would create a nicer package that can be used in some way other than ccall.
But we probably want to just ccall anyway without otherwise changing BLAS's behavior here.
I also didn't bother wrapping anything other than what I wanted to test (matmul, lu, ldiv, and rdiv).
We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.
We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.
Currently, a problem with it is that despite hiting LibBLASTrampoline, it doesn't actually replace any existing methods, so we still need to manually use ccall to test anything.
But, yeah, we should probably just call accelerate directly to reduce the risk of something going wrong. The trampoline may also introduce a small amount of overhead, i.e. an extra call?
Why does OpenBLAS perform so much worse on Windows than on Linux?