LoopVectorization.jl
LoopVectorization.jl copied to clipboard
handle loop-generating expressions
generating expressions could be transformed to traditional loops to make the following work
acc = 0
@avx for i in 1:10
acc += sum(i+j for j in 1:5)
end
or for ntuple()
:
acc = 0
@avx for i in 1:10
acc += sum(ntuple(j->i+j, Val(5)))
end
Edit: I guess this is related to handling anonymous functions, but at least the generator expression is directly available in the ast.
It'd be easy if I could special case sum
by knowing it is adding, but what about the more general foo
?
acc = 0
@avx for i in 1:10
acc += foo(i+j for j in 1:5)
end
How can I convert into two nested loops?
Basically, while I can see the generator expression, a current limitation (until it's able to integrate more with Julia's compiler) is that it cannot introspect into functions. So it can't know anything about what foo
is doing to this generator. It's a black box.
Perhaps it would be better not to turn it into two nested loops, but just a single loop and extract the generator, turning it into an anonymous function whose arguments are any variables encountered so far:
julia> f = i -> sum(i+j for j in 1:5)
#12 (generic function with 1 method)
julia> f(1)
20
julia> f(_MM{8}(1)) # LoopVectorization represents vectorized loop induction variables with `_MM`
SVec{8,Int64}<20, 25, 30, 35, 40, 45, 50, 55>
Similar could go for ntuple
.
So as long as foo
is generic enough to handle this, it should still be able to get some benefit.
I agree that foo(i+j for j in 1:5)
cannot be turned into a nested loop without introspection into foo
, but there are a bunch of foldl
-type reduction functions, so any of the following could be detected:
sum(genexp)
prod(genexp)
minimum(genexp)
maximum(genexp)
count(genexp)
reduce(op, genexp) # and foldl / foldr
mapreduce(f, op, genexp) # and mapfoldl / mapfoldr
where the last two include special variants like mapfoldl
/mapfoldr
.
Your second approach is basically about supporting user-defined functions, as long as they are generic enough to accept vectorized arguments, right? For small generators that is probably better, for longer ones I'm not sure. I thought that LoopVectorization benefits from more explicit loop information.
Your second approach is basically about supporting user-defined functions, as long as they are generic enough to accept vectorized arguments, right?
Those should work right now, but they're treated as opaque, and assumed to be slow enough that any fancy transformations won't be profitable.
Functions like log
aren't opaque, but are generally expensive enough for that to be the case anyway. When the functions are expensive, SIMD can help a lot, even without fancy transformations:
julia> using LinearAlgebra, BenchmarkTools, LoopVectorization
julia> U = cholesky(Symmetric(rand(200,300) |> x -> x * x')).U;
julia> function trianglelogdet(A)
s = zero(eltype(A))
@avx for n ∈ axes(A,1)
s += log(A[n,n])
end
s
end
trianglelogdet (generic function with 1 method)
julia> @btime trianglelogdet($U)
243.315 ns (0 allocations: 0 bytes)
279.85258637906907
julia> @btime logdet($U)
1.178 μs (0 allocations: 0 bytes)
279.85258637906907
julia> @btime sum(log, diag($U))
1.257 μs (1 allocation: 1.77 KiB)
279.85258637906907
julia> s = 0.0; LoopVectorization.@avx_debug for n ∈ axes(U,1)
s += log(U[n,n])
end
OPS = Tuple{:LoopVectorization,:LOOPCONSTANTINSTRUCTION,LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000011, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02),:LoopVectorization,:log,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000002, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:vadd,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000000103, LoopVectorization.compute, 0x00, 0x01)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:U,Symbol("##vptr##_U")}(0x0000000000000101, 0x0000000000000101, 0x0000000000000000)}
AM = Tuple{0,Tuple{4},Tuple{1},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:n}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x000055adfdbc5940, (200,)), 0.0)
begin
$(Expr(:meta, :inline))
begin
var"##n_loopstop#329" = (#= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:33 =# @inbounds(lb[1])).U
var"##vptr##_U" = #= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:109 =# @inbounds(vargs[1])
var"##symlicm#334" = #= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:181 =# @inbounds(vargs[2])
var"##Tloopeltype##" = eltype(var"##vptr##_U")
var"##Wvecwidth##" = LoopVectorization.pick_vector_width_val(var"##Tloopeltype##")
var"####op#330_" = var"##symlicm#334"
var"##mask##" = LoopVectorization.mask(var"##Wvecwidth##", var"##n_loopstop#329")
var"####op#330_0" = LoopVectorization.vzero(var"##Wvecwidth##", typeof(var"####op#330_"))
end
begin
n = LoopVectorization._MM(var"##Wvecwidth##", 1)
while LoopVectorization.scalar_less(n, var"##n_loopstop#329" - LoopVectorization.valsub(var"##Wvecwidth##", 2))
var"####op#331_0" = LoopVectorization.vload(var"##vptr##_U", (n, n))
var"####op#332_0" = LoopVectorization.log(var"####op#331_0")
var"####op#330_0" = LoopVectorization.vadd(var"####op#330_0", var"####op#332_0")
n = LoopVectorization.valadd(var"##Wvecwidth##", n)
end
if LoopVectorization.scalar_less(n, var"##n_loopstop#329" + 1)
var"####op#331_0" = LoopVectorization.vload(var"##vptr##_U", (n, n), var"##mask##")
var"####op#332_0" = LoopVectorization.log(var"####op#331_0")
var"####op#330_0" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vadd(var"####op#330_0", var"####op#332_0"), var"####op#330_0")
n = LoopVectorization.valadd(var"##Wvecwidth##", n)
end
end
end
julia> mylog(x) = log(x)
mylog (generic function with 1 method)
julia> function trianglemylogdet(A)
s = zero(eltype(A))
@avx for n ∈ 1:size(A,1)
s += mylog(A[n,n])
end
s
end
trianglemylogdet (generic function with 1 method)
julia> @btime trianglemylogdet($U)
237.071 ns (0 allocations: 0 bytes)
279.85258637906907
But yeah, this with all your examples will be a lot slower than if I could translate them into loops, unless it's mapreduce
where the map op is something expensive like log
.
Reductions without expensive map operations need to be unrolled for performance, because the reduction introduces a dependency chain in the acumulator. Unrolling breaks up that dependency chain, because it's then split up among multiple accumulators.
Adding support for mapreduce
in the loops (and treating sum
and friends like special cases) would be a good idea.
Yes, for mapreduce
it would be a good fit, mapfoldl
/mapfoldr
on the other hand make promises about reduction order, so there one should probably be careful.