BandedMatrices.jl icon indicating copy to clipboard operation
BandedMatrices.jl copied to clipboard

Performance relative to sparse

Open jlperla opened this issue 5 years ago • 7 comments

On Julia 1.1. I did a quick performance test and am finding that Banded work well on the linear solve, but are much slower than the basic Sparse for other operations. Does this make sense?

Try

using BandedMatrices, SparseArrays, BenchmarkTools
N = 10000
A = brand(N,N,4,3)
As = sparse(A)
b = randn(N)
println("As sparse")
@btime $As*$As
@btime $As*$b
@btime $As\$b
println("As Banded")
@btime $A*$A
@btime $A*$b
@btime $A\$b

which gave

As sparse
  2.608 ms (8 allocations: 10.79 MiB)
  78.500 μs (2 allocations: 78.20 KiB)
  17.916 ms (73 allocations: 15.55 MiB)
As Banded
  365.126 ms (8 allocations: 1.14 MiB)
  183.199 μs (3 allocations: 78.25 KiB)
  1.857 ms (10 allocations: 1.07 MiB)

Now the solve is very impressive! But the others are significantly slower. Did I do something wrong in the test?

This can also be compared against Tridiagonal

using BandedMatrices, SparseArrays, BenchmarkTools, LinearAlgebra
N = 10000
A = brand(N,N,1,1)
As = sparse(A)
Atd = Tridiagonal(A)
b = randn(N)
println("As sparse")
@btime $As*$As
@btime $As*$b
@btime $As\$b
println("As Tridiagonal")
@btime $Atd*$Atd
@btime $Atd*$b
@btime $Atd\$b
println("As Banded")
@btime $A*$A
@btime $A*$b
@btime $A\$b

which gives

As sparse
  453.001 μs (8 allocations: 1.60 MiB)
  39.299 μs (2 allocations: 78.20 KiB)
  11.544 ms (73 allocations: 10.45 MiB)
As Tridiagonal
  11.969 s (6 allocations: 2.24 GiB)
  13.999 μs (2 allocations: 78.20 KiB)
  303.699 μs (15 allocations: 469.34 KiB)
As Banded
  1.258 ms (8 allocations: 391.05 KiB)
  212.701 μs (3 allocations: 78.25 KiB)
  1.241 ms (10 allocations: 469.11 KiB)

jlperla avatar May 16 '19 23:05 jlperla

I don't see the same timings as you:

As sparse
  1.443 ms (8 allocations: 10.79 MiB)
  89.451 μs (2 allocations: 78.20 KiB)
  16.090 ms (73 allocations: 15.55 MiB)
As Tridiagonal
  15.367 s (6 allocations: 2.24 GiB)
  11.517 μs (2 allocations: 78.20 KiB)
  243.495 μs (15 allocations: 469.34 KiB)
As Banded
  16.961 ms (8 allocations: 1.14 MiB)
  50.714 μs (3 allocations: 78.25 KiB)
  1.436 ms (10 allocations: 1.07 MiB)

But I believe this is partly because OpenBLAS is slow for banded matrices. Let me try with MKL when I get to the office.

Note that for small bandwidths its more about the memory usage than speed. But for large bandwidths the banded * banded is far from optimal.

dlfivefifty avatar May 17 '19 06:05 dlfivefifty

That is good to see. I don't think that banded * banded is the crucial operation in any algorithms, but the Matrix-vector is. I am glad you asked about MKL...

For MKL, I have an RA trying to patch together a feasible solution for replacing OpenBlas with MKL for mortals. It involves:

  • Double-checking that MKL.jl can be built into the system image. Not sure if there is any missing functionality?
  • MKLSparse.jl, where we would make sure that standard sparse operations such as A \ b work for general matrices, using the MKL operations instead of SuiteSparse
  • Arpack.jl support for MKL builds, which basically involves a separate https://github.com/arnavs/ArpackMKLBuilder for the binaries, where we would patch Arpack.jl to use it when MKL.jl is installed.

Now... if we do all that and want to use BandedMatrices.jl and BlockBandedMatrices.jl, is the lack of OpenBLAS going to break anything? Do we need to add in bindings to MKL into either of those libraries, or will MKL.jl already do that?

jlperla avatar May 17 '19 17:05 jlperla

Also: Is a banded matrix with 1 upper and 1 lower at all binary compatible with a tridiagonal matrix which would allow reinterpretation? If so, could a specialization be done in those cases to use the tridiagonal code?

jlperla avatar May 17 '19 17:05 jlperla

is the lack of OpenBLAS going to break anything?

BandedMatrices.jl calls BLAS.gbmv! and friends, so it should be fine. Building Julia from source with MKL works great. Unfortunately I couldn't get it to build, but I recall that it was 4x faster than OpenBLAS bringing it in line with sparse. This was mentioned in my blog post.

Also: Is a banded matrix with 1 upper and 1 lower at all binary compatible with a tridiagonal matrix which would allow reinterpretation?

No. But I think the speedup is due to native Julia code being faster than BLAS for low bandwidths. I believe a well-written BandedMatrix which knows the bandwidths at compile time should actually be faster as there is more memory locality, but that's outside the scope.

Short story: don't expect magic for small bandwidths.

dlfivefifty avatar May 17 '19 19:05 dlfivefifty

Gotcha. @arnavs Make sure to add BandedMatrices.jl and https://github.com/JuliaMatrices/BlockBandedMatrices.jl to the list of stuff to test after you get MKL and MKLSparse up and running.

jlperla avatar May 17 '19 19:05 jlperla

For what it's worth, I'm working on some code that involves a lot of multiplications between matrices with single bands and I had naively assumed this would just be screaming fast compared to sparse and didn't even bother benchmarking that part until recently. It looks like I might have to whip up my own custom types to support fast multiplication.

MasonProtter avatar Apr 03 '20 16:04 MasonProtter

This shouldn't require a special type: an if bandwidth(A,1) == -bandwidth(A,2) special case in matrix multiplication would suffice. Lowering to broadcasting would probably be the most efficient/general here.

dlfivefifty avatar Apr 04 '20 19:04 dlfivefifty