Avoid materializing arrays in bidiag matmul
Currently, small Bidiagonal/Tridiagonal matrices are materialized in matrix multiplications, but this is wasteful and unnecessary. This PR changes this to use a naive matrix multiplication for small matrices, and fall back to the banded multiplication for larger ones. Multiplication by a Bidiagonal falls back to a banded matrix multiplication for all sizes in the current implementation, and iterates in a cache-friendly manner for the non-Bidiagonal matrix.
In certain cases, the matrices were being materialized if the non-structured matrix was small, even if the structured matrix was large. This is changed as well in this PR.
Some improvements in performance:
julia> B = Bidiagonal(rand(3), rand(2), :U); A = rand(size(B)...); C = similar(A);
julia> @btime mul!($C, $A, $B);
193.152 ns (6 allocations: 352 bytes) # nightly v"1.12.0-DEV.1034"
18.826 ns (0 allocations: 0 bytes) # This PR
julia> T = Tridiagonal(rand(99), rand(100), rand(99)); A = rand(2, size(T,2)); C = similar(A);
julia> @btime mul!($C, $A, $T);
9.398 μs (8 allocations: 79.94 KiB) # nightly
416.407 ns (0 allocations: 0 bytes) # This PR
julia> B = Bidiagonal(rand(300), rand(299), :U); A = rand(20000, size(B,2)); C = similar(A);
julia> @btime mul!($C, $A, $B);
33.395 ms (0 allocations: 0 bytes) # nightly
6.695 ms (0 allocations: 0 bytes) # This PR (cache-friendly)
Closes https://github.com/JuliaLang/julia/pull/55414
After 2d15c70, the performance of a Bidiagonal * Bidiagonal improves as the destination is accessed in a column-major order.
julia> BL = Bidiagonal([1:400;], [1:399;], :L);
julia> C = similar(BL, size(BL));
julia> @btime mul!($C, $BL, $BL);
28.031 μs (6 allocations: 6.38 KiB) # nightly
24.398 μs (0 allocations: 0 bytes) # This PR
Uff, this was supposed to supersede https://github.com/JuliaLang/julia/pull/55414, I guess?
Yes, I should probably have closed that one. I'll rebase this to resolve the conflicts.
I'll rebase this to resolve the conflicts.
I have done that already. In the resolution, I kept the zero-size early drop-out.
Thanks, that was quick!
Any idea why the apple-darwin addmul tests are failing?
test failures seem to be timeouts now
Gentle bump