LinearSolve.jl
LinearSolve.jl copied to clipboard
"Fast" solvers are slow for dense, complex matrix
According to the documentation, RFLUFactorization
is supposed to be faster than backslash for matrices on the order of a few hundred. I did some quick benchmarking and I'm not finding this to be the case. Am I doing something wrong?
using LinearSolve, BenchmarkTools
n = 200
A = rand(ComplexF64, n, n)
b = rand(ComplexF64, n)
@btime $A\$b ; # 474 μs
prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, LUFactorization()); # 477 μs. As expected, similar to `\`
prob = LinearProblem(copy(A), copy(b))
@btime solve($prob); # 20.4 ms. Not sure what method is used here.
prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, RFLUFactorization()); # 21.8 ms. Forty-some times slower than BLAS LU
prob = LinearProblem(copy(A), copy(b))
method = LinearSolve.FastLUFactorization()
@btime solve($prob, $method); # 14.9 ms . Better but still way slower than BLAS
using LinearAlgebra
BLAS.get_config()
#=
LinearAlgebra.BLAS.LBTConfig
Libraries:
└ [ILP64] libopenblas64_.so
=#
I'm running Julia v1.7.3 and LinearSolve v1.22.1
Also, I found that FastLUFactorization
is not exported so I had to invoke it as LinearSolve.FastLUFactorization()
.
How many Julia threads are open on your system? Show versioninfo()
@chriselrod
LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf
right?
Those are complex matrices. You could make a PR to add actual complex support to RecursiveFactorization.
Or wait for the LoopVectorization rewrite to be finished. Development is active there, but it will still be a while.
Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?
8 threads on my system:
julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 8
LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?
Yes, they should. I mean, FastLUFact is just an alternative interface to BLAS that skips rebuilding the workspace each time, so it's not that it should be as fast as BLAS, it is a BLAS call 😅 . So that's a pretty confusing result here that needs to get explored a bit more.
I opened a PR that tweaks the default algorithms a bit more: https://github.com/SciML/LinearSolve.jl/pull/160 . Though I think we should figure out what's going on with FastLUFactorization.
Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?
LoopVectorization.jl currently only supports Union{Bool,Base.HWReal}
.
If we have a special need for it somewhere, I could take the time to hack on support to LoopVectorization.jl, otherwise it's better to wait for the rewrite, where it will come for free.
I'm happy to wait for the rewrite. However, I'm also not seeing the advertised speedup on Float64:
Algorithm | n = 200 | n = 500 | n = 2000 |
---|---|---|---|
Backslash | 0.2 | 1.3 | 29 |
(Default) | 0.2 | 1.2 | 29 |
LUFactorization | 0.2 | 1.3 | 29 |
RFLUFactorization | 0.2 | 1.2 | 48 |
Fast LUFactorization | 0.3 | 2.1 | 53.7 |
n
is the order of the matrix and the table entries are times in msec reported by @btime
.
&& <= length(500)
https://github.com/SciML/LinearSolve.jl/blob/main/src/default.jl#L17
So the default algorithm was selected to be RFLUFactorization
for the n = 200 and n=500 cases and it shouldn't be expected to be competitive for the n = 2000 case? Sounds good. But why isn't it any faster than backslash in the two smaller cases?
I need to investigate what's happening in FastLUFactorization. This should be taking exactly as long as Backslash, until you do a re-solve where it should be faster. Should be easy to find. I wonder if this is also true for Float64.
But why isn't it any faster than backslash in the two smaller cases?
Can you run the benchmarks? https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/master/perf/lu.jl
And share your CPU? It at least was about 2x faster for things we tested https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/pull/28
julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
"Manjaro Linux"
uname: Linux 5.10.126-1-MANJARO #1 SMP PREEMPT Mon Jun 27 10:02:42 UTC 2022 x86_64 unknown
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz:
speed user nice sys idle irq
#1 4507 MHz 318248 s 206 s 296408 s 9844998 s 8758 s
#2 4513 MHz 320355 s 109 s 297091 s 9845778 s 9752 s
#3 4497 MHz 323075 s 112 s 298174 s 9845949 s 5462 s
#4 4592 MHz 318678 s 205 s 301067 s 9844768 s 5241 s
#5 4555 MHz 319502 s 390 s 298411 s 9850884 s 5040 s
#6 4574 MHz 315547 s 477 s 299586 s 9854946 s 5172 s
#7 4499 MHz 321453 s 81 s 295622 s 9854642 s 4808 s
#8 4571 MHz 322751 s 583 s 295157 s 9853703 s 4787 s
Memory: 62.5269775390625 GB (46829.6015625 MB free)
Uptime: 1.04880138e6 sec
Load Avg: 1.15 1.33 0.77
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
JULIA_EDITOR = code
JULIA_NUM_THREADS = 8
DRAWHOME = /usr/share/opencascade/resources/DrawResources
WINDOWPATH = 2
HOME = /home/simonp
PATH = /home/simonp/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
TERM = xterm-256color
julia> include("perf/lu.jl")
[ Info: 4 × 4
...
[ Info: 500 × 500
The created file name is "lu_float64_1.7.3_skylake_8cores_OpenBLAS.png":
Looks like the big advantage of fully recursive factorization is only for order n <= 180
or so. Still, it's really impressive that a pure Julia factorization is competitive or better than BLAS throughout the full range of the tests.
We should make a version of that just do all of the LinearSolve.jl options.
I updated a few of the defaults in https://github.com/SciML/LinearSolve.jl/pull/160. I'll keep the cutoff at 500 for now since RF is never "bad", though I did change it to move away from RF if the user doesn't have Julia threads enabled. OpenBLAS has some bad CPUs, so that is a bit safer, until large. But FastLUFactorization will stay out of the defaults until we figure out what's going on there (it's just using BLAS, so how is it slower 😅).
Thanks for your fast response on this issue. Looking forward to using this for complex matrices (ubiquitous in my work) in the future.
How many threads did you use in your benchmark?
Julia was started with 8 threads. BLAS threading was set by the script:
nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)
which looks like it would be 8 as well.
For completeness, here is the result of bench-marking after using MKL
:
Yup so that fits the heuristic of the default algorithm:
(length(b) <= 100 || (isopenblas() && length(b) <= 500))
If <100 or if openBLAS less than 500, use RecursiveFactorization. Otherwise use the (provided) BLAS. So I think that's pretty close to what you measure.
I wonder if we could provide MKLLUFactorization which doesn't overload LBT?
Maybe we could even raise the 100; MKL only beat it at 160. MKL wins because it is much better at multithreading. I had a plot somewhere, where my computer (cascadelake) matched MKL up to 500.
I'd also be interested in seeing single threaded plots of both, which would be relevant for any code trying to do solves in parallel, and thus single threading the lu
.
I wonder if we could provide MKLLUFactorization which doesn't overload LBT?
We could, by depending on MKL_jll
and ccall
ing directly.
I'd also be interested in seeing single threaded plots of both Here is the plot for single-threaded vs OpenBLAS:
and here is the corresponding plot for MKL:
Let's take benchmarking discussions to https://github.com/SciML/LinearSolve.jl/issues/166 . MKL to https://github.com/SciML/LinearSolve.jl/issues/164 . This for FastLUFactorization being slow.
I'm almost done with some similar runs on Windows. Want me to put the plots in a comment in #166?
Yes please!