LinearSolve.jl icon indicating copy to clipboard operation
LinearSolve.jl copied to clipboard

"Fast" solvers are slow for dense, complex matrix

Open simonp0420 opened this issue 1 year ago • 25 comments

According to the documentation, RFLUFactorization is supposed to be faster than backslash for matrices on the order of a few hundred. I did some quick benchmarking and I'm not finding this to be the case. Am I doing something wrong?

using LinearSolve, BenchmarkTools

n = 200
A = rand(ComplexF64, n, n)
b = rand(ComplexF64, n)

@btime $A\$b ; # 474 μs

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, LUFactorization()); # 477 μs. As expected, similar to `\`

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob); # 20.4 ms.  Not sure what method is used here.

prob = LinearProblem(copy(A), copy(b))
@btime solve($prob, RFLUFactorization()); # 21.8 ms.  Forty-some times slower than BLAS LU

prob = LinearProblem(copy(A), copy(b))
method = LinearSolve.FastLUFactorization()
@btime solve($prob, $method); # 14.9 ms .  Better but still way slower than BLAS

using LinearAlgebra
BLAS.get_config()
#=
LinearAlgebra.BLAS.LBTConfig
Libraries: 
└ [ILP64] libopenblas64_.so
=#

I'm running Julia v1.7.3 and LinearSolve v1.22.1

Also, I found that FastLUFactorization is not exported so I had to invoke it as LinearSolve.FastLUFactorization().

simonp0420 avatar Jul 17 '22 02:07 simonp0420

How many Julia threads are open on your system? Show versioninfo()

ChrisRackauckas avatar Jul 17 '22 08:07 ChrisRackauckas

@chriselrod

ChrisRackauckas avatar Jul 17 '22 08:07 ChrisRackauckas

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

rayegun avatar Jul 17 '22 13:07 rayegun

Those are complex matrices. You could make a PR to add actual complex support to RecursiveFactorization.

Or wait for the LoopVectorization rewrite to be finished. Development is active there, but it will still be a while.

chriselrod avatar Jul 17 '22 14:07 chriselrod

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

ChrisRackauckas avatar Jul 17 '22 14:07 ChrisRackauckas

8 threads on my system:

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 8

simonp0420 avatar Jul 17 '22 14:07 simonp0420

LUFact and FastLUFact should have approximately the same time for the initial solve I think? They're both getrf right?

Yes, they should. I mean, FastLUFact is just an alternative interface to BLAS that skips rebuilding the workspace each time, so it's not that it should be as fast as BLAS, it is a BLAS call 😅 . So that's a pretty confusing result here that needs to get explored a bit more.

I opened a PR that tweaks the default algorithms a bit more: https://github.com/SciML/LinearSolve.jl/pull/160 . Though I think we should figure out what's going on with FastLUFactorization.

ChrisRackauckas avatar Jul 17 '22 14:07 ChrisRackauckas

Oh RecursiveFactorization.jl doesn't have a specialization for complex matrices? I didn't know that. We should specialize the default algorithm on that fact and document it then. What exactly is the issue, that it's only fast for Float32 and Float64?

LoopVectorization.jl currently only supports Union{Bool,Base.HWReal}.

If we have a special need for it somewhere, I could take the time to hack on support to LoopVectorization.jl, otherwise it's better to wait for the rewrite, where it will come for free.

chriselrod avatar Jul 18 '22 02:07 chriselrod

I'm happy to wait for the rewrite. However, I'm also not seeing the advertised speedup on Float64:

Algorithm n = 200 n = 500 n = 2000
Backslash 0.2 1.3 29
(Default) 0.2 1.2 29
LUFactorization 0.2 1.3 29
RFLUFactorization 0.2 1.2 48
Fast LUFactorization 0.3 2.1 53.7

n is the order of the matrix and the table entries are times in msec reported by @btime.

simonp0420 avatar Jul 18 '22 03:07 simonp0420

&& <= length(500)

https://github.com/SciML/LinearSolve.jl/blob/main/src/default.jl#L17

ChrisRackauckas avatar Jul 18 '22 03:07 ChrisRackauckas

So the default algorithm was selected to be RFLUFactorization for the n = 200 and n=500 cases and it shouldn't be expected to be competitive for the n = 2000 case? Sounds good. But why isn't it any faster than backslash in the two smaller cases?

simonp0420 avatar Jul 18 '22 03:07 simonp0420

I need to investigate what's happening in FastLUFactorization. This should be taking exactly as long as Backslash, until you do a re-solve where it should be faster. Should be easy to find. I wonder if this is also true for Float64.

rayegun avatar Jul 18 '22 03:07 rayegun

But why isn't it any faster than backslash in the two smaller cases?

Can you run the benchmarks? https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/blob/master/perf/lu.jl

And share your CPU? It at least was about 2x faster for things we tested https://github.com/JuliaLinearAlgebra/RecursiveFactorization.jl/pull/28

ChrisRackauckas avatar Jul 18 '22 11:07 ChrisRackauckas

julia> versioninfo(verbose=true)
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      "Manjaro Linux"
  uname: Linux 5.10.126-1-MANJARO #1 SMP PREEMPT Mon Jun 27 10:02:42 UTC 2022 x86_64 unknown
  CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz: 
              speed         user         nice          sys         idle          irq
       #1  4507 MHz     318248 s        206 s     296408 s    9844998 s       8758 s
       #2  4513 MHz     320355 s        109 s     297091 s    9845778 s       9752 s
       #3  4497 MHz     323075 s        112 s     298174 s    9845949 s       5462 s
       #4  4592 MHz     318678 s        205 s     301067 s    9844768 s       5241 s
       #5  4555 MHz     319502 s        390 s     298411 s    9850884 s       5040 s
       #6  4574 MHz     315547 s        477 s     299586 s    9854946 s       5172 s
       #7  4499 MHz     321453 s         81 s     295622 s    9854642 s       4808 s
       #8  4571 MHz     322751 s        583 s     295157 s    9853703 s       4787 s
       
  Memory: 62.5269775390625 GB (46829.6015625 MB free)
  Uptime: 1.04880138e6 sec
  Load Avg:  1.15  1.33  0.77
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
  DRAWHOME = /usr/share/opencascade/resources/DrawResources
  WINDOWPATH = 2
  HOME = /home/simonp
  PATH = /home/simonp/.local/bin:/usr/local/bin:/usr/bin:/var/lib/snapd/snap/bin:/usr/local/sbin:/usr/lib/jvm/default/bin:/usr/bin/site_perl:/usr/bin/vendor_perl:/usr/bin/core_perl
  TERM = xterm-256color

julia> include("perf/lu.jl")
[ Info: 4 × 4
...
[ Info: 500 × 500

The created file name is "lu_float64_1.7.3_skylake_8cores_OpenBLAS.png": lu_float64_1 7 3_skylake_8cores_OpenBLAS

Looks like the big advantage of fully recursive factorization is only for order n <= 180 or so. Still, it's really impressive that a pure Julia factorization is competitive or better than BLAS throughout the full range of the tests.

simonp0420 avatar Jul 18 '22 16:07 simonp0420

We should make a version of that just do all of the LinearSolve.jl options.

I updated a few of the defaults in https://github.com/SciML/LinearSolve.jl/pull/160. I'll keep the cutoff at 500 for now since RF is never "bad", though I did change it to move away from RF if the user doesn't have Julia threads enabled. OpenBLAS has some bad CPUs, so that is a bit safer, until large. But FastLUFactorization will stay out of the defaults until we figure out what's going on there (it's just using BLAS, so how is it slower 😅).

ChrisRackauckas avatar Jul 18 '22 16:07 ChrisRackauckas

Thanks for your fast response on this issue. Looking forward to using this for complex matrices (ubiquitous in my work) in the future.

simonp0420 avatar Jul 18 '22 16:07 simonp0420

How many threads did you use in your benchmark?

chriselrod avatar Jul 18 '22 17:07 chriselrod

Julia was started with 8 threads. BLAS threading was set by the script:

nc = min(Int(VectorizationBase.num_cores()), Threads.nthreads())
BLAS.set_num_threads(nc)

which looks like it would be 8 as well.

simonp0420 avatar Jul 18 '22 19:07 simonp0420

For completeness, here is the result of bench-marking after using MKL: lu_float64_1 7 3_skylake_8cores_MKL

simonp0420 avatar Jul 18 '22 19:07 simonp0420

Yup so that fits the heuristic of the default algorithm:

(length(b) <= 100 || (isopenblas() && length(b) <= 500))

If <100 or if openBLAS less than 500, use RecursiveFactorization. Otherwise use the (provided) BLAS. So I think that's pretty close to what you measure.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

ChrisRackauckas avatar Jul 18 '22 19:07 ChrisRackauckas

Maybe we could even raise the 100; MKL only beat it at 160. MKL wins because it is much better at multithreading. I had a plot somewhere, where my computer (cascadelake) matched MKL up to 500.

I'd also be interested in seeing single threaded plots of both, which would be relevant for any code trying to do solves in parallel, and thus single threading the lu.

I wonder if we could provide MKLLUFactorization which doesn't overload LBT?

We could, by depending on MKL_jll and ccalling directly.

chriselrod avatar Jul 18 '22 22:07 chriselrod

I'd also be interested in seeing single threaded plots of both Here is the plot for single-threaded vs OpenBLAS: lu_float64_1 7 3_skylake_1cores_OpenBLAS and here is the corresponding plot for MKL: lu_float64_1 7 3_skylake_1cores_MKL

simonp0420 avatar Jul 19 '22 00:07 simonp0420

Let's take benchmarking discussions to https://github.com/SciML/LinearSolve.jl/issues/166 . MKL to https://github.com/SciML/LinearSolve.jl/issues/164 . This for FastLUFactorization being slow.

ChrisRackauckas avatar Jul 19 '22 00:07 ChrisRackauckas

I'm almost done with some similar runs on Windows. Want me to put the plots in a comment in #166?

simonp0420 avatar Jul 19 '22 01:07 simonp0420

Yes please!

ChrisRackauckas avatar Jul 19 '22 01:07 ChrisRackauckas