LoopVectorization.jl
LoopVectorization.jl copied to clipboard
AssertionError: !(iszero(length(loopstartᵢ)))
Not sure if I'm doing something illegal here, but I get an assertion error from the following set of llops
using LoopVectorization
function avxbug(U = randn(2,2), E1 = randn(2))
t = 1
a = 1.0
_s = 0.0
k,n = size(U)
@avx for j = 1:k
for i = 1:n
u = tanh(a * U[i,j])
v = a * (1 - t * t)
U[i,j] = u
_s += v
end
E1[j] = _s / n
end
end
avxbug()
julia> avxbug()
ERROR: AssertionError: !(iszero(length(loopstartᵢ)))
Stacktrace:
[1] add_loop_start_stop_manager!(::LoopVectorization.LoopSet) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/loopstartstopmanager.jl:163
[2] lower_unrollspec(::LoopVectorization.LoopSet) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/lowering.jl:662
[3] lower(::LoopVectorization.LoopSet, ::Array{Symbol,1}, ::Symbol, ::Symbol, ::Symbol, ::Int64, ::Int64, ::Bool) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/lowering.jl:673
[4] lower_and_split_loops(::LoopVectorization.LoopSet, ::Int64) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/split_loops.jl:121
[5] lower_and_split_loops(::LoopVectorization.LoopSet, ::Int64) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/split_loops.jl:106
[6] avx_body(::LoopVectorization.LoopSet, ::Tuple{Int8,Int8,Int8,Int64}) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/reconstruct_loopset.jl:455
[7] #s134#86 at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/reconstruct_loopset.jl:499 [inlined]
[8] #s134#86(::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Any, ::Any, ::Any) at ./none:0
[9] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:527
[10] avxbug(::Array{Float64,2}, ::Array{Float64,1}) at /home/fredrikb/arl/bird_clustering/ica.jl:41
[11] avxbug() at /home/fredrikb/arl/bird_clustering/ica.jl:37
[12] top-level scope at none:1
If I replace _s += v with _s += 1.0 I instead get
julia> avxbug()
ERROR: InexactError: trunc(Int64, Inf)
Stacktrace:
[1] trunc at ./float.jl:703 [inlined]
[2] round at ./float.jl:367 [inlined]
[3] determine_unroll_factor(::LoopVectorization.LoopSet, ::Array{Symbol,1}, ::Symbol) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/determinestrategy.jl:315
[4] choose_order_cost at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/determinestrategy.jl:1047 [inlined]
[5] lower_and_split_loops(::LoopVectorization.LoopSet, ::Int64) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/split_loops.jl:100
[6] lower_and_split_loops(::LoopVectorization.LoopSet, ::Int64) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/split_loops.jl:106
[7] avx_body(::LoopVectorization.LoopSet, ::Tuple{Int8,Int8,Int8,Int64}) at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/reconstruct_loopset.jl:455
[8] #s134#86 at /home/fredrikb/.julia/packages/LoopVectorization/ii0y6/src/reconstruct_loopset.jl:499 [inlined]
[9] #s134#86(::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Any, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Type{T} where T, ::Any, ::Any, ::Any) at ./none:0
[10] (::Core.GeneratedFunctionStub)(::Any, ::Vararg{Any,N} where N) at ./boot.jl:527
[11] avxbug(::Array{Float64,2}, ::Array{Float64,1}) at /home/fredrikb/arl/bird_clustering/ica.jl:40
[12] avxbug() at /home/fredrikb/arl/bird_clustering/ica.jl:36
[13] top-level scope at none:1
It ran into problems because of the expression v = a * (1 - t * t). Neither a and t depend on any of the loops. This should have already worked on that commit:
function avxnobug(U = randn(2,2), E1 = randn(2))
t = 1
a = 1.0
_s = 0.0
k,n = size(U)
@avx for j = 1:k
for i = 1:n
u = tanh(a * U[i,j])
U[i,j] = u
end
_s = a * (1 - t * t) * n
E1[j] = _s / n
end
U, E1
end
function nobug(U = randn(2,2), E1 = randn(2))
t = 1
a = 1.0
_s = 0.0
k,n = size(U)
for j = 1:k
for i = 1:n
u = tanh(a * U[i,j])
v = a * (1 - t * t)
U[i,j] = u
_s += v
end
E1[j] = _s / n
end
U, E1
end
U = randn(2,2); E1 = randn(2);
avxnobug(copy(U), copy(E1))
nobug(copy(U), copy(E1))
yielding
julia> U = randn(2,2); E1 = randn(2);
julia> avxnobug(copy(U), copy(E1))
([0.24030644201158394 0.8361737654653512; 0.6322427919588169 0.9160705807364357], [0.0, 0.0])
julia> nobug(copy(U), copy(E1))
([0.2403064420115839 0.8361737654653512; 0.632242791958817 0.9160705807364357], [0.0, 0.0])
Maybe I should handle this eventually by having it recognize summing a constant n times and replace it with multiplying by n.
But that's no reason for it not to work now, so I fixed it and added it to the tests.
The specific error you saw was because it was evaluating a loopnests where one of the loops was not actually used for anything, (i.e., for i in ... with i not being used anywhere). That came about because: U[i,j] is indexed most efficiently along i and E1[j] along j, meaning it decided to split these up into separate loops. The loop loading from and filling U[i,j] of course used both i and j, but the loop filling E1 did not use i. Normally, this would have meant discarding the i loop altogether, but, because of the accumulation of _s in that loop, it couldn't discard that loop despite it not being used. Hence the reported error.
I don't know if this was the product of code-gen or a minimal broken example:
for i = 1:n
v = a * (1 - t * t)
_s += v
end
if the former, maybe you can get it to hoist the repeated sum out of the loop (for the performance improvement)? If not, feel free to file another issue, and I can eventually work on that optimization. If the latter, hopefully this fixes the more complicated issue as well. I suspect this fix isn't too robust -- I just changed a couple places to make it return a 0 instead instead of throwing errors when there were no arrays (and confirmed that it started producing the correct answer for this problem). It's likely to do things like placing code at the wrong level in a loop nest.
Thanks for the fix :)
I checkout out LV#master to try but now I got another error (maybe master branch is not expected to work without master branch on one of the other related packages?):
function update_UE!(U::AbstractMatrix{T} = randn(10,10), E1::AbstractVector{T} = randn(10)) where T
n,k = size(U)
_s = zero(T)
a = 1.0
@avx for j = 1:k
for i = 1:n
t = tanh(a * U[i,j])
U[i,j] = t
_s += a * (1 - t^2)
end
E1[j] = _s / n
end
U,E1
end
update_UE!()
signal (11): Segmentation fault
in expression starting at none:0
jl_gc_pool_alloc at /home/fredrikb/julia/src/gc.c:1148
jl_gc_alloc_ at /home/fredrikb/julia/src/julia_internal.h:277 [inlined]
jl_gc_alloc at /home/fredrikb/julia/src/gc.c:3149
_new_array_ at /home/fredrikb/julia/src/array.c:94 [inlined]
_new_array at /home/fredrikb/julia/src/array.c:162 [inlined]
jl_alloc_array_1d at /home/fredrikb/julia/src/array.c:433
jl_uncompress_ir at /home/fredrikb/julia/src/dump.c:2689
retrieve_code_info at ./compiler/utilities.jl:110 [inlined]
InferenceState at ./compiler/inferencestate.jl:118
typeinf_ext at ./compiler/typeinfer.jl:568
typeinf_ext at ./compiler/typeinfer.jl:601
jfptr_typeinf_ext_21235 at /home/fredrikb/sys (unknown line)
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
jl_type_infer at /home/fredrikb/julia/src/gf.c:296
jl_generate_fptr at /home/fredrikb/julia/src/jitlayers.cpp:290
jl_compile_method_internal at /home/fredrikb/julia/src/gf.c:1964
jl_compile_method_internal at /home/fredrikb/julia/src/gf.c:1931 [inlined]
_jl_invoke at /home/fredrikb/julia/src/gf.c:2224 [inlined]
jl_apply_generic at /home/fredrikb/julia/src/gf.c:2398
#208 at /home/fredrikb/.julia/packages/Atom/8MnXm/src/eval.jl:136
unknown function (ip: 0x7feb92be62fc)
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
do_apply at /home/fredrikb/julia/src/builtins.c:655
jl_f__apply_latest at /home/fredrikb/julia/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710 [inlined]
invokelatest at ./essentials.jl:709 [inlined]
macro expansion at /home/fredrikb/.julia/packages/Media/ItEPc/src/dynamic.jl:24 [inlined]
eval at /home/fredrikb/.julia/packages/Atom/8MnXm/src/eval.jl:113
unknown function (ip: 0x7feb92b4bf4a)
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
do_apply at /home/fredrikb/julia/src/builtins.c:655
jl_f__apply_latest at /home/fredrikb/julia/src/builtins.c:705
#invokelatest#1 at ./essentials.jl:710
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
do_apply at /home/fredrikb/julia/src/builtins.c:655
invokelatest at ./essentials.jl:709
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
do_apply at /home/fredrikb/julia/src/builtins.c:655
macro expansion at /home/fredrikb/.julia/packages/Atom/8MnXm/src/eval.jl:41 [inlined]
#188 at ./task.jl:356
unknown function (ip: 0x7feb92b0360c)
jl_apply at /home/fredrikb/julia/src/julia.h:1690 [inlined]
start_task at /home/fredrikb/julia/src/task.c:707
unknown function (ip: (nil))
Allocations: 34334643 (Pool: 34322673; Big: 11970); GC: 40
If I instead puth the @avx on the inner loop, it works, but becomes much slower
function update_UE!(U::AbstractMatrix{T} = randn(10,10), E1::AbstractVector{T} = randn(10)) where T
n,k = size(U)
_s = zero(T)
a = 1.0
for j = 1:k
@avx for i = 1:n
t = tanh(a * U[i,j])
U[i,j] = t
_s += a * (1 - t^2)
end
E1[j] = _s / n
end
U,E1
end
U = randn(100,100)
E1 = randn(100)
@btime update_UE!($U,$E1)
10.912 ms (0 allocations: 0 bytes) # With avx
238.923 μs (0 allocations: 0 bytes) # Without avx
I can reproduce the crash.
But for the inner loop version, @avx is >15 times faster for me than the non-avx version.
Is it perhaps emitting an instruction that is fast on your cpu but for my old cpu is becoming very slow? I've heard that this can happen for fma, but may happen for other instructions as well? I'm using an old i5 2500K
I found out what is causing the crash. I'll try and fix that soon.
I've heard that this can happen for
fma
That was exactly the problem. This should be fixed on SLEEFPirates master, a new version will merge into the registry soon. Please let me know if that helps.
I also added a SLEEFPirates.tanh_fast function which is much faster on my computer.
Sweet :) I can confirm that @avx makes it much faster on a modern CPU (Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz).
- I've tried
SLEEFPIrates#masterbut it does not appear to solve the problem on my slow CPU. - Using
SLEEFPirates.tanh_faston the fast cpu makes the function about twice as fast. Annotating the cal totanhwith@fastmath t = tanh(a * U[i,j])produces an error:
julia> @btime update_UE!($U,$E1);
ERROR: UndefVarError: mul_fast not defined
Stacktrace:
[1] update_UE!(::Array{Float64,2}, ::Array{Float64,1}) at ./REPL[13]:6
[2] ##core#455(::Array{Float64,2}, ::Array{Float64,1}) at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:371
[3] ##sample#456(::BenchmarkTools.Parameters) at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:377
[4] _run(::BenchmarkTools.Benchmark{Symbol("##benchmark#454")}, ::BenchmarkTools.Parameters; verbose::Bool, pad::String, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{4,Symbol},NamedTuple{(:samples, :evals, :gctrial, :gcsample),Tuple{Int64,Int64,Bool,Bool}}}) at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:405
[5] (::Base.var"#inner#2"{Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}},typeof(BenchmarkTools._run),Tuple{BenchmarkTools.Benchmark{Symbol("##benchmark#454")},BenchmarkTools.Parameters}})() at ./essentials.jl:715
[6] #invokelatest#1 at ./essentials.jl:716 [inlined]
[7] #run_result#37 at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:32 [inlined]
[8] run(::BenchmarkTools.Benchmark{Symbol("##benchmark#454")}, ::BenchmarkTools.Parameters; progressid::Nothing, nleaves::Float64, ndone::Float64, kwargs::Base.Iterators.Pairs{Symbol,Integer,NTuple{5,Symbol},NamedTuple{(:verbose, :samples, :evals, :gctrial, :gcsample),Tuple{Bool,Int64,Int64,Bool,Bool}}}) at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:94
[9] #warmup#45 at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141 [inlined]
[10] warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#454")}) at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:141
[11] top-level scope at /home/fredrikb/.julia/packages/BenchmarkTools/eCEpo/src/execution.jl:481
BTW, numbers like these are awesome :smiley:
# On the fast CPU
130.192 μs (1 allocation: 32 bytes) # Baseline
56.411 μs (1 allocation: 32 bytes) # @avx
52.914 μs (1 allocation: 32 bytes) # SLEEFPirates master
21.719 μs (1 allocation: 32 bytes) # SLEEFPirates.tanh_fast
does not appear to solve the problem on my slow CPU. Okay, checking:
julia> sx = LoopVectorization.SVec(ntuple(Val(4)) do i Core.VecElement(rand()) end)
SVec{4,Float64}<0.31614194386076955, 0.5547664156127883, 0.9335191574787935, 0.8287090853244645>
julia> @code_llvm debuginfo=:none tanh(sx)
I suspect the problem is:
%res.i118 = ashr <4 x i64> %res.i188, <i64 1, i64 1, i64 1, i64 1>
%res.i112 = shl <4 x i64> %res.i118, <i64 52, i64 52, i64 52, i64 52>
%res.i108 = add <4 x i64> %res.i112, <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>
%res.i102 = bitcast <4 x i64> %res.i108 to <4 x double>
%res.i101 = fmul <4 x double> %res.i125, %res.i102
%res.i100 = sub nsw <4 x i64> %res.i188, %res.i118
%res.i99 = shl <4 x i64> %res.i100, <i64 52, i64 52, i64 52, i64 52>
%res.i95 = add <4 x i64> %res.i99, <i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408, i64 4607182418800017408>
You need AVX2 for SIMD integers.
Mind trying Julia nightly? With Base.libllvm_version ≥ v"9", it'll use a different version that probably won't really be faster on AVX2 CPUs, but it doesn't have the integer arithmetic. On master (but not on the latest release) I also replaced the @llvm.fma with @llvm.fmuladd, so it might give better performance on your old CPU.
Annotating the cal to tanh with @fastmath t = tanh(a * U[i,j]) produces an error
@fastmath isn't really supported at the moment (and SLEEFPirates.tanh_fast !== Base.FastMath.tanh_fast, although adding that overload may be a good idea), but it shouldn't take too much effort to have it speed up a few functions (like tanh) and not error on others (like multiplication).
BTW, numbers like these are awesome smiley
Thanks, SIMD special functions are such an easy and big win, it's a shame they're not more widely available/used.
Great stuff! On julia nightly and SLEEFPirates/LV # master I get for the slow CPU
# Float64
237.160 μs (0 allocations: 0 bytes) # baseline
131.666 μs (0 allocations: 0 bytes) # avx on inner loop
50.324 μs (0 allocations: 0 bytes) # avx + tanh_fast
# Float32
166.900 μs (0 allocations: 0 bytes) # baseline
50.082 μs (0 allocations: 0 bytes) # avx on inner loop
20.320 μs (0 allocations: 0 bytes) # avx + tanh_fast
Huge improvement :) And thanks for the clarification on @fastmath
Great!
tanh_fast should be the same across Julia versions on the same computer. The definition is simply:
@inline function tanh_fast(x)
exp2x = exp(x + x)
(exp2x - one(x)) / (exp2x + one(x))
end
Even though exp is accurate, tanh_fast will suffer from rounding error relative to tanh. But exp is very fast, especially on Linux (where it will try to a definition from GLIBC's mveclib.so), and should be the same across Julia versions.
Crash fixed and tested.
The splitintonoloop test doesn't work:
function splitintonoloop(U = randn(2,2), E1 = randn(2))
t = 1
a = 1.0
_s = 0.0
n, k = size(U)
@avx for j = 1:k
for i = 1:n
u = tanh(a * U[i,j])
v = a * (1 - t * t)
U[i,j] = u
_s += v
end
E1[j] = _s / n
end
U, E1
end
With t=1, 1-t^2=0 and the test didn't catch it was giving bogus answers.
The problem is that the value of _s on each iteration of the j loop depends on the previous, so it cannot be reordered.
Simpler example:
function splitintonoloop(E1, n)
t = 0.5
a = 1.0
_s = 0.0
k = length(E1);
@avx for j = 1:k
for i = 1:n
v = a * (1 - t * t)
_s += v
end
E1[j] = _s / n
end
E1
end
I fixed this on the vecunroll branch.
The diff there is currently at around +1100/-700, and it'll probably be a few more days before other tests pass and it's ready to merge.