LinearSolve.jl Benchmarking Scripts to tune the default algorithm choices

We should put together a benchmark script and have a bunch of people run it. It should just run LUFactorization, RFLUFactorization, and FastLUFactorization (and MKLFactorization when that exists).

It would be nice for this to have an option for what kind of matrix is generated as a function of some N, so for example it can be used to generate the matrices from the Brusselator equation for testing the sparse factorizations.

Jul 19 '22 00:07 ChrisRackauckas

This is a continuation of the benchmarking I presented in #159. There, I presented the results of running the perf/lu.jl script of the RecursiveFactorization package for a Linux desktop machine. I repeat that exercise here, after correcting a bug in the script (see this PR). The results below are for a Windows desktop machine with the following configuration:

(RecursiveFactorization) pkg> status
     Project RecursiveFactorization v0.2.11
      Status `D:\peter\Documents\julia\dev\RecursiveFactorization\Project.toml`
  [a93c6f00] DataFrames v1.3.4
  [bdcacae8] LoopVectorization v0.12.120
  [33e6dc65] MKL v0.5.0
  [f517fe37] Polyester v0.6.13
  [7792a7ef] StrideArraysCore v0.3.15
  [d5829a12] TriangularSolve v0.1.12
  [3d5dd08c] VectorizationBase v0.21.42
  [112f6efa] VegaLite v2.6.0
  [37e2e46d] LinearAlgebra

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
      Microsoft Windows [Version 10.0.22000.795]
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz:
              speed         user         nice          sys         idle          irq
       #1  3000 MHz    2431453            0     19001781    516223109     13460718  ticks
       #2  3000 MHz    5596203            0      3129484    528930375       187468  ticks
       #3  3000 MHz    3831859            0      3086890    530737312        40078  ticks
       #4  3000 MHz    3733187            0      1884296    532038578        28156  ticks
       #5  3000 MHz    2444562            0      2263937    532947562        36484  ticks
       #6  3000 MHz    2220062            0      1327734    534108265        28578  ticks
       #7  3000 MHz    2176875            0      1391343    534087843        28593  ticks
       #8  3000 MHz    2917562            0      1682937    533055546        53328  ticks

  Memory: 31.85821533203125 GB (22317.58984375 MB free)
  Uptime: 537656.0 sec
  Load Avg:  0.0  0.0  0.0
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = runemacs.exe
  CHOCOLATEYLASTPATHUPDATE = 132198172845121191
  HOME = D:\peter\Documents
  HOMEDRIVE = C:
  HOMEPATH = \Users\peter
  MIC_LD_LIBRARY_PATH = C:\Program Files (x86)\Common Files\Intel\Shared Libraries\compiler\lib\intel64_win_mic
  PATH = C:\Program Files\ImageMagick-7.1.0-Q16-HDRI;C:\Program Files (x86)\IntelSWTools\compilers_and_libraries_2020.1.216\windows\mpi\intel64\bin;C:\windows\system32;C:\windows;C:\windows\System32\Wbem;C:\windows\System32\WindowsPowerShell\v1.0\;C:\windows\System32\OpenSSH\;C:\Program Files (x86)\NVIDIA Corporation\PhysX\Common;C:\Program Files\NVIDIA Corporation\NVIDIA NvDLISR;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64_win\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32_win\compiler;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\intel64\compiler;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\mpirt;C:\Program Files (x86)\Common Files\Intel\Shared Libraries\redist\ia32\compiler;C:\Program Files (x86)\Common Files\Microsoft Shared\VSA\10.0\VsaEnv;C:\Program Files\Common Files\Microsoft Shared\Windows Live;C:\Program Files (x86)\Common Files\Microsoft Shared\Windows Live;C:\Program Files\MiKTeX 2.9\miktex\bin\x64;C:\Windows\twain_32\MP830;C:\WINDOWS\system32;C:\WINDOWS;C:\WINDOWS\System32\Wbem;c:\Program Files (x86)\ATI Technologies\ATI.ACE\Core-Static;C:\Program Files (x86)\Windows Live\Shared;C:\Program Files\gs\gs8.64\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\Program Files\Calibre2\;C:\Program Files\Microsoft\Web Platform Installer\;C:\Program Files (x86)\Microsoft ASP.NET\ASP.NET Web Pages\v1.0\;C:\Program Files\Microsoft SQL Server\110\Tools\Binn\;C:\Program Files (x86)\Git\cmd;C:\Program Files (x86)\Git\bin;C:\Program Files\TortoiseGit\bin;C:\ProgramData\chocolatey\bin;C:\Program Files\MATLAB\R2022a\bin;C:\Program Files (x86)\Calibre2\;C:\Program Files\Microsoft VS Code\bin;C:\WINDOWS\System32\WindowsPowerShell\v1.0\;C:\WINDOWS\System32\OpenSSH\;C:\Program Files\Git\cmd;C:\Program Files\nodejs\;C:\Users\peter\AppData\Local\Microsoft\WindowsApps;c:\usr\local\bin;C:\Users\peter\AppData\Local\Programs\MiKTeX 2.9\miktex\bin\x64\;C:\Users\peter\AppData\Local\GitHubDesktop\bin;C:\Users\peter\AppData\Local\Pandoc\;C:\cygwin64\usr\i686-w64-mingw32\sys-root\mingw\lib;C:\Program Files (x86)\Aspell\bin;C:\Users\peter\AppData\Local\gitkraken\bin;C:\Users\peter\AppData\Roaming\npm;C:\Users\peter\AppData\Local\Microsoft\WindowsApps
  PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.JL;.CPL
  PSMODULEPATH = D:\peter\Documents\WindowsPowerShell\Modules;C:\Program Files\WindowsPowerShell\Modules;C:\WINDOWS\system32\WindowsPowerShell\v1.0\Modules

First, the result of running the script after starting Julia with -t 8 and with the default OpenBLAS BLAS: lu_float64_1 7 3_skylake_8cores_OpenBLAS OpenBLAS performance seems incredibly bad!

Next, the 8-core result after using MKL: lu_float64_1 7 3_skylake_8cores_MKL

Next, the OpenBLAS result after starting Julia with -t 1: lu_float64_1 7 3_skylake_1cores_OpenBLAS

And finally, the MKL result after starting Julia with -t 1: lu_float64_1 7 3_skylake_1cores_MKL

Jul 19 '22 03:07 simonp0420

Out of curiosity I commented out the line in the script

#BLAS.set_num_threads(nc)

and restarted Julia with -t 8 using OpenBLAS. The result is: new_lu_float64_1 7 3_skylake_8cores_OpenBLAS

That didn't seem to help.

Jul 19 '22 03:07 simonp0420

Ah, re what I said earlier about using RF with 1 vs multiple threads: It doesn't benefit much from threading, but unlike OpenBLAS, Polyester won't weave a noose to hang itself with given enough threads.

OpenBLAS is dramatically faster with BLAS.set_num_threads(1) over this size range on most recent computers with many cores.

Jul 19 '22 03:07 chriselrod

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

Jul 19 '22 03:07 ChrisRackauckas

These days, should we ever go back to OpenBLAS? There's so many cases where it's just... bad.

What do you get on your AMD machine? I'm guessing MKL is still much better there for lu?

For Apple silicon, we could use Accelerate. It does quite well with a single core: lu_float64_1 9 0-DEV 623_apple-m1_1cores_OpenBLAS

But that single core gets to use their single matrix multiplier. With 4 cores: lu_float64_1 9 0-DEV 623_apple-m1_4cores_OpenBLAS OpenBLAS actually wins.

I should double check if Accelerate can benefit from multiple cores on the M1; maybe I just didn't set it. While it has only a single matrix multiplier, there's probably a lot of other things that can be done on the cores.

This was with 4 cores on a mac mini. I'm guessing OpenBLAS would hang itself again on an M1-max/ultra, given 8 threads.

RF wins below 100x100, at least.

@YingboMa and I should look into an algorithm better suited to threading.

Jul 19 '22 04:07 chriselrod

How do you make use of Accelerate?

Jul 19 '22 04:07 ChrisRackauckas

https://github.com/chriselrod/AppleAccelerateLinAlgWrapper.jl I gave it a deliberately bad name to avoid stealing a good one, hoping someone would create a nicer package that can be used in some way other than ccall. But we probably want to just ccall anyway without otherwise changing BLAS's behavior here.

I also didn't bother wrapping anything other than what I wanted to test (matmul, lu, ldiv, and rdiv).

Jul 19 '22 04:07 chriselrod

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

Jul 19 '22 04:07 ChrisRackauckas

We might as well add it to LinearSolve.jl if we can do it in a way that doesn't hit LibBLASTrampoline.

Currently, a problem with it is that despite hiting LibBLASTrampoline, it doesn't actually replace any existing methods, so we still need to manually use ccall to test anything.

But, yeah, we should probably just call accelerate directly to reduce the risk of something going wrong. The trampoline may also introduce a small amount of overhead, i.e. an extra call?

Jul 19 '22 04:07 chriselrod

Why does OpenBLAS perform so much worse on Windows than on Linux?

Jul 19 '22 13:07 simonp0420

LinearSolve.jl LinearSolve.jl copied to clipboard

Benchmarking Scripts to tune the default algorithm choices

LinearSolve.jl
LinearSolve.jl copied to clipboard