Tullio.jl icon indicating copy to clipboard operation
Tullio.jl copied to clipboard

Banded matrix multiplication does not LoopVectorize

Open jagot opened this issue 4 years ago • 5 comments

I have the slightly unusual use-case dense matrix = dense matrix * banded matrix (actually more complicated than this, but this shows the issue). With the help of @mcabbott I managed to implement the following:

using LinearAlgebra
using BandedMatrices
using Tullio
using LoopVectorization

function clear_off_bands!(A::BandedMatrix{T}) where T
    m,n = size(A.data)
    l,u = bandwidths(A)

    for j = 1:u
        A.data[1:u-j+1,j] .= zero(T)
    end
    mpn = m+n
    for j = n-l+1:n
        A.data[mpn-l-j+1:end,j] .= zero(T)
    end
    A
end

function my_mul!(w, v, A::BandedMatrix)
    # This implementation assumes that the off-band values, i.e. the
    # upper-left and lower-right corners, are zero, cf. clear_off_bands!
    l,u = bandwidths(A)
    m,n = size(A)
    @tullio verbose=2 w[i,j] = begin
        k = min(max(h + j - u - 1, 1), m)
        v[i,k] * A.data[h,j]
    end
    w
end

n = 4000
v = Matrix((1.0+0im)*I, n, n) # Complex identity matrix
w = similar(v)
w2 = similar(v)
w3 = similar(v)

A = clear_off_bands!(brand(ComplexF64, n, n, 5, 5))
At = BandedMatrix(transpose(A))

@time mul!(w, v, A) # This is exceedingly slow, since this uses generic dense multiplication
@time mul!(transpose(w2), At, transpose(v)) # This will BLAS
@time my_mul!(w3, v, A)

@show norm(w-w2), norm(w-w3), norm(A-w3)

which gives a warning

┌ Warning: LoopVectorization failed
│   err =
│    LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element
│    in expression starting at /Users/jagot/.julia/packages/Tullio/bgqFi/src/macro.jl:1092
└ @ Tullio ~/.julia/packages/Tullio/bgqFi/src/macro.jl:1118

but other than that seems to work well. On my laptop I get the following timings:

  • Naïve dense: 76.651751 s
  • BLAS with transposes: 1.152771 seconds
  • Tullio w/o AVX: 0.529101 seconds 🎉

cc @chriselrod

jagot avatar Feb 22 '21 10:02 jagot

I forgot that LoopVectorization does not work with complex numbers out-of-the box, but the same warning appears when changing to Ints in the example above.

jagot avatar Feb 22 '21 13:02 jagot

Good point. But the error here is generated when expanding the macro. The expression it tries to expand is this:

┌ Info: =====LV===== LoopVectorization actor 
│   verbosetidy(lex) =
│    quote
│        local @inline(function 𝒜𝒸𝓉!(::Type{<:Array{<:Union{Base.HWReal, Bool}}}, ℛ::AbstractArray{𝒯}, v, A, 𝒶𝓍i, 𝒶𝓍j, 𝒶𝓍h, ♻️ = nothing, 💀 = true) where 𝒯
│                    @info "running LoopVectorization actor " maxlog = 3 _id = 0x972e86e5f73d041f
│                    LoopVectorization.check_args(v, A) || @error("rejected by LoopVectorization's check_args! ", maxlog = 3, _id = 0x972e86e5f73d041f)
│                    LoopVectorization.@avx unroll = 0 for j = 𝒶𝓍j
│                            for i = 𝒶𝓍i
│                                𝒜𝒸𝒸 = if ♻️ === nothing
│                                        zero(𝒯)
│                                    else
│                                        ℛ[i, j]
│                                    end
│                                for h = 𝒶𝓍h
│                                    𝒜𝒸𝒸 = 𝒜𝒸𝒸 + begin
│                                                k = min(max(((h + j) - u) - 1, 1), m)
│                                                v[i, k] * A.data[h, j]
│                                            end
│                                end
│                                ℛ[i, j] = 𝒜𝒸𝒸
│                            end
│                        end
│                end)
└    end
┌ Warning: LoopVectorization failed 
│   err =
│    LoadError: ArgumentError: Collection has multiple elements, must contain exactly 1 element

If you re-organise that as follows, then it seems that @avx is happy, although I haven't tested this:

@macroexpand @avx for j = 𝒶𝓍j
    for i = 𝒶𝓍i
        𝒜𝒸𝒸 = if ♻️ === nothing
                zero(𝒯)
            else
                ℛ[i, j]
            end
        for h = 𝒶𝓍h
            k = min(max(((h + j) - u) - 1, 1), m)
            𝒜𝒸𝒸 = 𝒜𝒸𝒸 + v[i, k] * A.data[h, j]
        end
        ℛ[i, j] = 𝒜𝒸𝒸
    end
end

It might not be hard to always re-organise multi-line expressions like this.

mcabbott avatar Feb 22 '21 13:02 mcabbott

However, in this case I think the expression should be equivalent to this:

@tullio w[i,j] = v[i,clamp(h + j - $u - 1)] * A.data[h,j]

clamp and mod (with one argument) are specially recognised by the macro, and it will not try to infer ranges from the argument.

mcabbott avatar Feb 22 '21 13:02 mcabbott

It might not be hard to always re-organise multi-line expressions like this.

Meaning the last line of the block is the summand, basically?

jagot avatar Feb 22 '21 13:02 jagot

It might not be hard to always re-organise multi-line expressions like this.

I could fix that in LoopVectorization as well. This checks that there is only 1 element in the block. It should instead walk begin:end-1 and extract the last.

chriselrod avatar Feb 22 '21 14:02 chriselrod