LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Fallback path: annotate `for` loops with `@simd`?
Here's the fallback path for the basic matmul kernel:
$(Expr(:inbounds, true))
local var"#1#val" = for n = 1:size(A, 1), m = 1:size(B, 2)
#= REPL[2]:2 =#
Cₙₘ = zero(eltype(C))
#= REPL[2]:3 =#
for k = 1:size(A, 2)
#= REPL[2]:4 =#
Cₙₘ = Base.FastMath.add_fast(Cₙₘ, Base.FastMath.mul_fast(A[n, k], B[k, m]))
end
#= REPL[2]:6 =#
C[n, m] = Cₙₘ
end
$(Expr(:inbounds, :pop))
var"#1#val"
In the fallback path, @inbounds
and @fastmath
have both been applied.
Would it also make sense to annotate the for
loop in the fallback path with @simd
?
If the answer to the first question is yes, then the second question is: is it safe to annotate the for
loop in the fallback path with @simd ivdep
instead of @simd
?
The primary reason it doesn't add @simd
is because it must be added to the innermost loop.
It's easy to just add @inbounds @fastmath
, but @simd [ivdep]
would require parsing the loop-nest to find the inner loops.
As for ivdep
, it can cause wrong answers in some cases where LoopVectorization wont:
using LoopVectorization, Random, BenchmarkTools
function simd_test!(m, x, y)
@inbounds @simd for i ∈ eachindex(m,x,y)
m[i] = x[i] ⊻ y[i]
end
end
function ivdep_test!(m, x, y)
@inbounds @simd ivdep for i ∈ eachindex(m,x,y)
m[i] = x[i] ⊻ y[i]
end
end
function avx_test!(m, x, y)
@avx for i ∈ eachindex(m,x,y)
m[i] = x[i] ⊻ y[i]
end
end
x = bitrand(1000); y = bitrand(1000);
m0 = similar(x); m1 = similar(x); m2 = similar(x);
simd_test!(m0, x, y)
ivdep_test!(m1, x, y)
avx_test!(m2, x, y)
m0 == m1 # false
m0 == m2 # true
This yields:
julia> x = bitrand(1000); y = bitrand(1000);
julia> m0 = similar(x); m1 = similar(x); m2 = similar(x);
julia> simd_test!(m0, x, y)
julia> ivdep_test!(m1, x, y)
julia> avx_test!(m2, x, y)
julia> m0 == m1 # false
false
julia> m0 == m2 # true
true
julia> @benchmark simd_test!($m0, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.524 μs (0.00% GC)
median time: 1.538 μs (0.00% GC)
mean time: 1.613 μs (0.00% GC)
maximum time: 4.553 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 10
julia> @benchmark ivdep_test!($m0, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 603.955 ns (0.00% GC)
median time: 605.525 ns (0.00% GC)
mean time: 632.046 ns (0.00% GC)
maximum time: 932.350 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 177
julia> @benchmark avx_test!($m0, $x, $y)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.217 ns (0.00% GC)
median time: 16.761 ns (0.00% GC)
mean time: 17.452 ns (0.00% GC)
maximum time: 40.086 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
@simd ivdep
is about 3x faster than @simd
, but gets the wrong answer.
Meanwhile, @avx
is 90x faster and gets the correct answer.
ivdep
's non-aliasing guarantees are unfortunately extremely strong. The reason we get the wrong answer above is because BitArray
elements alias themselves: the individual bits of the BitArray
are part of the same UInt64
.
It'd be better if we had an equivalent of restrict
, but that's hard to define generically.
The closest thing is probably Experimental.Const
with Experimental.@aliasscopes
, but that only applies to Array
s, so no BitArray
support and certainly not the generic support the fallback loop is meant to provide.
So while I would like @simd ivdep
, I think we may need to play it safer and use only @simd
.
But it's worth checking if, for example, Octavian.PtrArray
s will SIMD without it.
But a motivation for @ivdep
is that it really does help a lot of the time, e.g. my Octavian
PR defines a PointerMatrix
type that needs @simd ivdep
to SIMD:
julia> using Octavian, VectorizationBase, LoopVectorization, BenchmarkTools
julia> X = rand(72, 240);
julia> Y = similar(X);
julia> pY = Octavian.PointerMatrix(stridedpointer(Y), size(Y));
julia> function unsafe_copyto!(B, A)
@inbounds for i ∈ eachindex(B, A)
B[i] = A[i]
end
end
unsafe_copyto! (generic function with 1 method)
julia> function unsafe_copyto_ivdep!(B, A)
@inbounds @simd ivdep for i ∈ eachindex(B, A)
B[i] = A[i]
end
end
unsafe_copyto_ivdep! (generic function with 1 method)
julia> function unsafe_copyto_simd!(B, A)
@inbounds @simd for i ∈ eachindex(B, A)
B[i] = A[i]
end
end
unsafe_copyto_simd! (generic function with 1 method)
julia> function unsafe_copyto_avx!(B, A)
@avx for i ∈ eachindex(B, A)
B[i] = A[i]
end
end
unsafe_copyto_avx! (generic function with 1 method)
julia> @benchmark unsafe_copyto!($pY, $X)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 11.538 μs (0.00% GC)
median time: 11.807 μs (0.00% GC)
mean time: 12.155 μs (0.00% GC)
maximum time: 31.371 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark unsafe_copyto_ivdep!($pY, $X)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.152 μs (0.00% GC)
median time: 2.172 μs (0.00% GC)
mean time: 2.219 μs (0.00% GC)
maximum time: 4.235 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
julia> @benchmark unsafe_copyto_simd!($pY, $X)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 5.232 μs (0.00% GC)
median time: 5.248 μs (0.00% GC)
mean time: 5.337 μs (0.00% GC)
maximum time: 10.017 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 6
julia> @benchmark unsafe_copyto_avx!($pY, $X)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.070 μs (0.00% GC)
median time: 2.081 μs (0.00% GC)
mean time: 2.115 μs (0.00% GC)
maximum time: 4.121 μs (0.00% GC)
--------------
samples: 10000
evals/sample: 9
This is unfortunate. One of the advantages of packing during matmul (as done by my Octavion PR) is that it would copy data from arbitrary array types -- regardless of whether or not they're strided -- into strided arrays. These non-strided arrays aren't supported by LoopVectorization, but we still want out copy function to be fast.