Banded matrix multiplication does not LoopVectorize
I have the slightly unusual use-case dense matrix = dense matrix * banded matrix (actually more complicated than this, but this shows the issue). With the help of @mcabbott I managed to implement the following:
using LinearAlgebra
using BandedMatrices
using Tullio
using LoopVectorization
function clear_off_bands!(A::BandedMatrix{T}) where T
m,n = size(A.data)
l,u = bandwidths(A)
for j = 1:u
A.data[1:u-j+1,j] .= zero(T)
end
mpn = m+n
for j = n-l+1:n
A.data[mpn-l-j+1:end,j] .= zero(T)
end
A
end
function my_mul!(w, v, A::BandedMatrix)
# This implementation assumes that the off-band values, i.e. the
# upper-left and lower-right corners, are zero, cf. clear_off_bands!
l,u = bandwidths(A)
m,n = size(A)
@tullio verbose=2 w[i,j] = begin
k = min(max(h + j - u - 1, 1), m)
v[i,k] * A.data[h,j]
end
w
end
n = 4000
v = Matrix((1.0+0im)*I, n, n) # Complex identity matrix
w = similar(v)
w2 = similar(v)
w3 = similar(v)
A = clear_off_bands!(brand(ComplexF64, n, n, 5, 5))
At = BandedMatrix(transpose(A))
@time mul!(w, v, A) # This is exceedingly slow, since this uses generic dense multiplication
@time mul!(transpose(w2), At, transpose(v)) # This will BLAS
@time my_mul!(w3, v, A)
@show norm(w-w2), norm(w-w3), norm(A-w3)
which gives a warning
┌ Warning: LoopVectorization failed
│ err =
│ LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
│ in expression starting at /Users/jagot/.julia/packages/Tullio/bgqFi/src/macro.jl:1092
└ @ Tullio ~/.julia/packages/Tullio/bgqFi/src/macro.jl:1118
but other than that seems to work well. On my laptop I get the following timings:
- Naïve dense: 76.651751 s
- BLAS with transposes: 1.152771 seconds
- Tullio w/o AVX: 0.529101 seconds 🎉
cc @chriselrod
I forgot that LoopVectorization does not work with complex numbers out-of-the box, but the same warning appears when changing to Ints in the example above.
Good point. But the error here is generated when expanding the macro. The expression it tries to expand is this:
┌ Info: =====LV===== LoopVectorization actor
│ verbosetidy(lex) =
│ quote
│ local @inline(function 𝒜𝒸𝓉!(::Type{<:Array{<:Union{Base.HWReal, Bool}}}, ℛ::AbstractArray{𝒯}, v, A, 𝒶𝓍i, 𝒶𝓍j, 𝒶𝓍h, ♻️ = nothing, 💀 = true) where 𝒯
│ @info "running LoopVectorization actor " maxlog = 3 _id = 0x972e86e5f73d041f
│ LoopVectorization.check_args(v, A) || @error("rejected by LoopVectorization's check_args! ", maxlog = 3, _id = 0x972e86e5f73d041f)
│ LoopVectorization.@avx unroll = 0 for j = 𝒶𝓍j
│ for i = 𝒶𝓍i
│ 𝒜𝒸𝒸 = if ♻️ === nothing
│ zero(𝒯)
│ else
│ ℛ[i, j]
│ end
│ for h = 𝒶𝓍h
│ 𝒜𝒸𝒸 = 𝒜𝒸𝒸 + begin
│ k = min(max(((h + j) - u) - 1, 1), m)
│ v[i, k] * A.data[h, j]
│ end
│ end
│ ℛ[i, j] = 𝒜𝒸𝒸
│ end
│ end
│ end)
└ end
┌ Warning: LoopVectorization failed
│ err =
│ LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
If you re-organise that as follows, then it seems that @avx is happy, although I haven't tested this:
@macroexpand @avx for j = 𝒶𝓍j
for i = 𝒶𝓍i
𝒜𝒸𝒸 = if ♻️ === nothing
zero(𝒯)
else
ℛ[i, j]
end
for h = 𝒶𝓍h
k = min(max(((h + j) - u) - 1, 1), m)
𝒜𝒸𝒸 = 𝒜𝒸𝒸 + v[i, k] * A.data[h, j]
end
ℛ[i, j] = 𝒜𝒸𝒸
end
end
It might not be hard to always re-organise multi-line expressions like this.
However, in this case I think the expression should be equivalent to this:
@tullio w[i,j] = v[i,clamp(h + j - $u - 1)] * A.data[h,j]
clamp and mod (with one argument) are specially recognised by the macro, and it will not try to infer ranges from the argument.
It might not be hard to always re-organise multi-line expressions like this.
Meaning the last line of the block is the summand, basically?
It might not be hard to always re-organise multi-line expressions like this.
I could fix that in LoopVectorization as well. This checks that there is only 1 element in the block. It should instead walk begin:end-1 and extract the last.