BandedMatrices.jl
BandedMatrices.jl copied to clipboard
Performance relative to sparse
On Julia 1.1. I did a quick performance test and am finding that Banded work well on the linear solve, but are much slower than the basic Sparse for other operations. Does this make sense?
Try
using BandedMatrices, SparseArrays, BenchmarkTools
N = 10000
A = brand(N,N,4,3)
As = sparse(A)
b = randn(N)
println("As sparse")
@btime $As*$As
@btime $As*$b
@btime $As\$b
println("As Banded")
@btime $A*$A
@btime $A*$b
@btime $A\$b
which gave
As sparse
2.608 ms (8 allocations: 10.79 MiB)
78.500 μs (2 allocations: 78.20 KiB)
17.916 ms (73 allocations: 15.55 MiB)
As Banded
365.126 ms (8 allocations: 1.14 MiB)
183.199 μs (3 allocations: 78.25 KiB)
1.857 ms (10 allocations: 1.07 MiB)
Now the solve is very impressive! But the others are significantly slower. Did I do something wrong in the test?
This can also be compared against Tridiagonal
using BandedMatrices, SparseArrays, BenchmarkTools, LinearAlgebra
N = 10000
A = brand(N,N,1,1)
As = sparse(A)
Atd = Tridiagonal(A)
b = randn(N)
println("As sparse")
@btime $As*$As
@btime $As*$b
@btime $As\$b
println("As Tridiagonal")
@btime $Atd*$Atd
@btime $Atd*$b
@btime $Atd\$b
println("As Banded")
@btime $A*$A
@btime $A*$b
@btime $A\$b
which gives
As sparse
453.001 μs (8 allocations: 1.60 MiB)
39.299 μs (2 allocations: 78.20 KiB)
11.544 ms (73 allocations: 10.45 MiB)
As Tridiagonal
11.969 s (6 allocations: 2.24 GiB)
13.999 μs (2 allocations: 78.20 KiB)
303.699 μs (15 allocations: 469.34 KiB)
As Banded
1.258 ms (8 allocations: 391.05 KiB)
212.701 μs (3 allocations: 78.25 KiB)
1.241 ms (10 allocations: 469.11 KiB)
I don't see the same timings as you:
As sparse
1.443 ms (8 allocations: 10.79 MiB)
89.451 μs (2 allocations: 78.20 KiB)
16.090 ms (73 allocations: 15.55 MiB)
As Tridiagonal
15.367 s (6 allocations: 2.24 GiB)
11.517 μs (2 allocations: 78.20 KiB)
243.495 μs (15 allocations: 469.34 KiB)
As Banded
16.961 ms (8 allocations: 1.14 MiB)
50.714 μs (3 allocations: 78.25 KiB)
1.436 ms (10 allocations: 1.07 MiB)
But I believe this is partly because OpenBLAS is slow for banded matrices. Let me try with MKL when I get to the office.
Note that for small bandwidths its more about the memory usage than speed. But for large bandwidths the banded * banded is far from optimal.
That is good to see. I don't think that banded * banded
is the crucial operation in any algorithms, but the Matrix-vector is. I am glad you asked about MKL...
For MKL, I have an RA trying to patch together a feasible solution for replacing OpenBlas with MKL for mortals. It involves:
- Double-checking that
MKL.jl
can be built into the system image. Not sure if there is any missing functionality? - MKLSparse.jl, where we would make sure that standard sparse operations such as
A \ b
work for general matrices, using the MKL operations instead of SuiteSparse - Arpack.jl support for MKL builds, which basically involves a separate https://github.com/arnavs/ArpackMKLBuilder for the binaries, where we would patch Arpack.jl to use it when MKL.jl is installed.
Now... if we do all that and want to use BandedMatrices.jl
and BlockBandedMatrices.jl
, is the lack of OpenBLAS going to break anything? Do we need to add in bindings to MKL into either of those libraries, or will MKL.jl already do that?
Also: Is a banded matrix with 1 upper and 1 lower at all binary compatible with a tridiagonal matrix which would allow reinterpretation? If so, could a specialization be done in those cases to use the tridiagonal code?
is the lack of OpenBLAS going to break anything?
BandedMatrices.jl calls BLAS.gbmv!
and friends, so it should be fine. Building Julia from source with MKL works great. Unfortunately I couldn't get it to build, but I recall that it was 4x faster than OpenBLAS bringing it in line with sparse
. This was mentioned in my blog post.
Also: Is a banded matrix with 1 upper and 1 lower at all binary compatible with a tridiagonal matrix which would allow reinterpretation?
No. But I think the speedup is due to native Julia code being faster than BLAS for low bandwidths. I believe a well-written BandedMatrix
which knows the bandwidths at compile time should actually be faster as there is more memory locality, but that's outside the scope.
Short story: don't expect magic for small bandwidths.
Gotcha. @arnavs Make sure to add BandedMatrices.jl and https://github.com/JuliaMatrices/BlockBandedMatrices.jl to the list of stuff to test after you get MKL and MKLSparse up and running.
For what it's worth, I'm working on some code that involves a lot of multiplications between matrices with single bands and I had naively assumed this would just be screaming fast compared to sparse and didn't even bother benchmarking that part until recently. It looks like I might have to whip up my own custom types to support fast multiplication.
This shouldn't require a special type: an if bandwidth(A,1) == -bandwidth(A,2)
special case in matrix multiplication would suffice. Lowering to broadcasting would probably be the most efficient/general here.