LoopVectorization.jl Fallback path: annotate `for` loops with `@simd`?

Here's the fallback path for the basic matmul kernel:

$(Expr(:inbounds, true))
local var"#1#val" = for n = 1:size(A, 1), m = 1:size(B, 2)
            #= REPL[2]:2 =#
            Cₙₘ = zero(eltype(C))
            #= REPL[2]:3 =#
            for k = 1:size(A, 2)
                #= REPL[2]:4 =#
                Cₙₘ = Base.FastMath.add_fast(Cₙₘ, Base.FastMath.mul_fast(A[n, k], B[k, m]))
            end
            #= REPL[2]:6 =#
            C[n, m] = Cₙₘ
        end
$(Expr(:inbounds, :pop))
var"#1#val"

In the fallback path, @inbounds and @fastmath have both been applied.

Would it also make sense to annotate the for loop in the fallback path with @simd?

Dec 28 '20 03:12 DilumAluthge

If the answer to the first question is yes, then the second question is: is it safe to annotate the for loop in the fallback path with @simd ivdep instead of @simd?

Dec 28 '20 03:12 DilumAluthge

The primary reason it doesn't add @simd is because it must be added to the innermost loop. It's easy to just add @inbounds @fastmath, but @simd [ivdep] would require parsing the loop-nest to find the inner loops.

As for ivdep, it can cause wrong answers in some cases where LoopVectorization wont:

using LoopVectorization, Random, BenchmarkTools
function simd_test!(m, x, y)
    @inbounds @simd for i ∈ eachindex(m,x,y)
        m[i] = x[i] ⊻ y[i]
    end
end
function ivdep_test!(m, x, y)
    @inbounds @simd ivdep for i ∈ eachindex(m,x,y)
        m[i] = x[i] ⊻ y[i]
    end
end
function avx_test!(m, x, y)
    @avx for i ∈ eachindex(m,x,y)
        m[i] = x[i] ⊻ y[i]
    end
end
x = bitrand(1000); y = bitrand(1000);
m0 = similar(x); m1 = similar(x); m2 = similar(x);
simd_test!(m0, x, y)
ivdep_test!(m1, x, y)
avx_test!(m2, x, y)
m0 == m1 # false
m0 == m2 # true

This yields:

julia> x = bitrand(1000); y = bitrand(1000);

julia> m0 = similar(x); m1 = similar(x); m2 = similar(x);

julia> simd_test!(m0, x, y)

julia> ivdep_test!(m1, x, y)

julia> avx_test!(m2, x, y)

julia> m0 == m1 # false
false

julia> m0 == m2 # true
true

julia> @benchmark simd_test!($m0, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.524 μs (0.00% GC)
  median time:      1.538 μs (0.00% GC)
  mean time:        1.613 μs (0.00% GC)
  maximum time:     4.553 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

julia> @benchmark ivdep_test!($m0, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     603.955 ns (0.00% GC)
  median time:      605.525 ns (0.00% GC)
  mean time:        632.046 ns (0.00% GC)
  maximum time:     932.350 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     177

julia> @benchmark avx_test!($m0, $x, $y)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.217 ns (0.00% GC)
  median time:      16.761 ns (0.00% GC)
  mean time:        17.452 ns (0.00% GC)
  maximum time:     40.086 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

@simd ivdep is about 3x faster than @simd, but gets the wrong answer. Meanwhile, @avx is 90x faster and gets the correct answer.

Dec 28 '20 05:12 chriselrod

ivdep's non-aliasing guarantees are unfortunately extremely strong. The reason we get the wrong answer above is because BitArray elements alias themselves: the individual bits of the BitArray are part of the same UInt64. It'd be better if we had an equivalent of restrict, but that's hard to define generically. The closest thing is probably Experimental.Const with Experimental.@aliasscopes, but that only applies to Arrays, so no BitArray support and certainly not the generic support the fallback loop is meant to provide.

So while I would like @simd ivdep, I think we may need to play it safer and use only @simd. But it's worth checking if, for example, Octavian.PtrArrays will SIMD without it.

Dec 28 '20 05:12 chriselrod

But a motivation for @ivdep is that it really does help a lot of the time, e.g. my Octavian PR defines a PointerMatrix type that needs @simd ivdep to SIMD:

julia> using Octavian, VectorizationBase, LoopVectorization, BenchmarkTools

julia> X = rand(72, 240);

julia> Y = similar(X);

julia> pY = Octavian.PointerMatrix(stridedpointer(Y), size(Y));

julia> function unsafe_copyto!(B, A)
           @inbounds for i ∈ eachindex(B, A)
               B[i] = A[i]
           end
       end
unsafe_copyto! (generic function with 1 method)

julia> function unsafe_copyto_ivdep!(B, A)
           @inbounds @simd ivdep for i ∈ eachindex(B, A)
               B[i] = A[i]
           end
       end
unsafe_copyto_ivdep! (generic function with 1 method)

julia> function unsafe_copyto_simd!(B, A)
           @inbounds @simd for i ∈ eachindex(B, A)
               B[i] = A[i]
           end
       end
unsafe_copyto_simd! (generic function with 1 method)

julia> function unsafe_copyto_avx!(B, A)
           @avx for i ∈ eachindex(B, A)
               B[i] = A[i]
           end
       end
unsafe_copyto_avx! (generic function with 1 method)

julia> @benchmark unsafe_copyto!($pY, $X)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.538 μs (0.00% GC)
  median time:      11.807 μs (0.00% GC)
  mean time:        12.155 μs (0.00% GC)
  maximum time:     31.371 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark unsafe_copyto_ivdep!($pY, $X)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.152 μs (0.00% GC)
  median time:      2.172 μs (0.00% GC)
  mean time:        2.219 μs (0.00% GC)
  maximum time:     4.235 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

julia> @benchmark unsafe_copyto_simd!($pY, $X)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.232 μs (0.00% GC)
  median time:      5.248 μs (0.00% GC)
  mean time:        5.337 μs (0.00% GC)
  maximum time:     10.017 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     6

julia> @benchmark unsafe_copyto_avx!($pY, $X)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.070 μs (0.00% GC)
  median time:      2.081 μs (0.00% GC)
  mean time:        2.115 μs (0.00% GC)
  maximum time:     4.121 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     9

This is unfortunate. One of the advantages of packing during matmul (as done by my Octavion PR) is that it would copy data from arbitrary array types -- regardless of whether or not they're strided -- into strided arrays. These non-strided arrays aren't supported by LoopVectorization, but we still want out copy function to be fast.

Dec 28 '20 07:12 chriselrod

LoopVectorization.jl LoopVectorization.jl copied to clipboard

Fallback path: annotate `for` loops with `@simd`?

LoopVectorization.jl
LoopVectorization.jl copied to clipboard