julia icon indicating copy to clipboard operation
julia copied to clipboard

LinearAlgebra: use band index in structured broadcast

Open jishnub opened this issue 1 year ago • 7 comments

This adds a new BandIndex type internal to LinearAlgebra that parallels a CartesianIndex, and stores a band index and a linear index of an element along that band. If the index of the band is a compile-time constant (which is often the case with Diagonal/Bidiagonal/Tridiagonal matrices), constant-propagation may eliminate branches in indexing into the matrix, and directly forward the indexing to the corresponding diagonal. This is particularly important in broadcasting for these matrices, which acts band-wise.

An example of an improvement in performance with this PR:

julia> using LinearAlgebra

julia> T = Tridiagonal(rand(893999), rand(894000), rand(893999)); # a large matrix

julia> @btime $T .+ $T;
  5.387 ms (10 allocations: 20.46 MiB) # v"1.12.0-DEV.337"
  2.872 ms (10 allocations: 20.46 MiB) # This PR

julia> @btime $T + $T; # reference
  2.885 ms (10 allocations: 20.46 MiB)

This makes the broadcast operation as fast as the sum, where the latter adds the diagonals directly.

I'm not 100% certain why this works as well as it does, as the constant band index may get lost in the newindex computation. I suspect branch prediction somehow works around this and preserves the constant.

jishnub avatar Apr 14 '24 06:04 jishnub

Gentle bump

jishnub avatar May 07 '24 18:05 jishnub

Gentle bump

jishnub avatar May 19 '24 10:05 jishnub

If you make it to JuliaCon, it would IMO be worth raising in the survey or elsewhere that so many PRs like this one stall for no apparent reason.

jonas-schulze avatar May 19 '24 12:05 jonas-schulze

This looks good to me. I'm curious about a couple of things?

  1. What's the runtime impact on, say, addition of ::Diagonal + ::Tridiagonal?
  2. Does this have any latency/TTFX effects?
  3. Do we have specialized addition/subtraction/(other?) methods for structured matrices, that we can remove, because ::AbstractMatrix + ::AbstractMatrix has a fallback via broadcasting?

I think this is worth a NEWS item, so maybe package maintainers can pick this up and do something nice with it.

dkarrasch avatar May 20 '24 09:05 dkarrasch

Runtimes improve in general:

julia> using LinearAlgebra

julia> D = Diagonal(rand(1000)); T = Tridiagonal(rand(999), rand(1000), rand(999)); B = Bidiagonal(rand(1000), rand(999), :U);

julia> @btime $D .+ $B;
  8.201 μs (6 allocations: 15.75 KiB) # nightly v"1.12.0-DEV.558"
  2.775 μs (6 allocations: 15.75 KiB) # this PR

julia> @btime $D .+ $T;
  4.904 μs (10 allocations: 23.67 KiB) # nightly
  2.718 μs (10 allocations: 23.67 KiB) # This PR

julia> @btime $B .+ $T;
  10.841 μs (10 allocations: 23.67 KiB) # nigthly
  4.134 μs (10 allocations: 23.67 KiB) # This PR

Latency does increase in certain cases compared to v"1.12.0-DEV.560", although it improves in others:

julia> @time D .+ D; # no impact, as there is only one band
  0.110434 seconds (159.40 k allocations: 8.525 MiB, 99.92% compilation time) # nightly
  0.112720 seconds (173.04 k allocations: 9.321 MiB, 99.92% compilation time) # this PR

julia> @time B .+ B;
  0.567609 seconds (578.01 k allocations: 30.555 MiB, 29.94% gc time, 99.98% compilation time) # nightly
  0.457304 seconds (678.50 k allocations: 36.486 MiB, 99.97% compilation time) # this PR

julia> @time T .+ T;
  0.320546 seconds (192.58 k allocations: 10.372 MiB, 53.73% gc time, 99.97% compilation time) # nightly
  0.210033 seconds (271.85 k allocations: 15.096 MiB, 99.95% compilation time) # this PR

julia> @time D .+ B;
  0.290067 seconds (573.75 k allocations: 30.220 MiB, 99.96% compilation time) # nightly
  0.371920 seconds (699.10 k allocations: 37.395 MiB, 99.96% compilation time) # This PR

julia> @time D .+ T;
  0.324574 seconds (212.92 k allocations: 11.460 MiB, 52.94% gc time, 99.97% compilation time) # nightly
  0.202575 seconds (297.82 k allocations: 16.446 MiB, 99.94% compilation time) # This PR

julia> @time B .+ T;
  0.348824 seconds (228.11 k allocations: 12.224 MiB, 49.64% gc time, 99.96% compilation time) # nightly
  0.203640 seconds (316.88 k allocations: 17.549 MiB, 99.93% compilation time) # This PR

Often, there appears to be a significant GC time on nightly that is absent in this PR. I'm not quite certain why this is the case, and why there isn't this GC overhead in some cases. Taking the GC time out, it would appear that this PR increases latency somewhat in many of the broadcasting operations (although the GC time on nightly appears to persist across runs, and is fairly consistent).

Regarding removing methods, this is a bit tricky, as there may be array types that specialize addition/subtraction but not broadcasting, and they may not be compatible with the fallback that uses scalar indexing. Perhaps this is an over-cautious approach, but let's leave these in for now.

Re. the NEWS entry, are you thinking of making the BandIndex type public? I got the idea from BandedMatrices which implements a somewhat similar interface, and could potentially collaborate with this. E.g. indexing into a Band may produce a BandIndex, so that one may use indexing operations like A[Band(1)[3]] to access the third element along the band. There is an outstanding issue about this (https://github.com/JuliaLinearAlgebra/BandedMatrices.jl/issues/223), although that is lager is scope.

jishnub avatar May 20 '24 13:05 jishnub

Regarding removing methods: I checked and found, e.g., this method:

@commutative function (+)(A::Bidiagonal, B::Diagonal)

What would happen, if we removed this and the similar below this? This is completely restricted to LinearAlgebra types. If broadcasting knew how to determine the target type, and otherwise used broadcasting and copyto! via indexing into bands, this would be safe to remove, wouldn't it?

dkarrasch avatar May 20 '24 14:05 dkarrasch

Currently, InfiniteArrays defines BroadcastStyle(::(Type{<:Diagonal{<:Any,<:AbstractInfUnitRange}}) and similar methods, so this would return a LazyArrays.BroadcastMatrix instead of a Bidiagonal if we remove this method. While the values are identical, I suspect this would not recognize the structure anymore. Perhaps InfiniteArrays shouldn't be defining this, but it'll require some concerted effort to work around these cases.

jishnub avatar May 20 '24 15:05 jishnub