LoopVectorization.jl handle loop-generating expressions

generating expressions could be transformed to traditional loops to make the following work

acc = 0
@avx for i in 1:10
    acc += sum(i+j for j in 1:5)
end

or for ntuple():

acc = 0
@avx for i in 1:10
    acc += sum(ntuple(j->i+j, Val(5)))
end

Edit: I guess this is related to handling anonymous functions, but at least the generator expression is directly available in the ast.

Apr 27 '20 20:04 stev47

It'd be easy if I could special case sum by knowing it is adding, but what about the more general foo?

acc = 0
@avx for i in 1:10
    acc += foo(i+j for j in 1:5)
end

How can I convert into two nested loops?

Basically, while I can see the generator expression, a current limitation (until it's able to integrate more with Julia's compiler) is that it cannot introspect into functions. So it can't know anything about what foo is doing to this generator. It's a black box.

Perhaps it would be better not to turn it into two nested loops, but just a single loop and extract the generator, turning it into an anonymous function whose arguments are any variables encountered so far:

julia> f = i -> sum(i+j for j in 1:5)
#12 (generic function with 1 method)

julia> f(1)
20

julia> f(_MM{8}(1)) # LoopVectorization represents vectorized loop induction variables with `_MM`
SVec{8,Int64}<20, 25, 30, 35, 40, 45, 50, 55>

Similar could go for ntuple.

So as long as foo is generic enough to handle this, it should still be able to get some benefit.

Apr 29 '20 05:04 chriselrod

I agree that foo(i+j for j in 1:5) cannot be turned into a nested loop without introspection into foo, but there are a bunch of foldl-type reduction functions, so any of the following could be detected:

sum(genexp)
prod(genexp)
minimum(genexp)
maximum(genexp)
count(genexp)
reduce(op, genexp) # and foldl / foldr
mapreduce(f, op, genexp) # and mapfoldl / mapfoldr

where the last two include special variants like mapfoldl/mapfoldr.

Your second approach is basically about supporting user-defined functions, as long as they are generic enough to accept vectorized arguments, right? For small generators that is probably better, for longer ones I'm not sure. I thought that LoopVectorization benefits from more explicit loop information.

Apr 29 '20 07:04 stev47

Your second approach is basically about supporting user-defined functions, as long as they are generic enough to accept vectorized arguments, right?

Those should work right now, but they're treated as opaque, and assumed to be slow enough that any fancy transformations won't be profitable. Functions like log aren't opaque, but are generally expensive enough for that to be the case anyway. When the functions are expensive, SIMD can help a lot, even without fancy transformations:

julia> using LinearAlgebra, BenchmarkTools, LoopVectorization

julia> U = cholesky(Symmetric(rand(200,300) |> x -> x * x')).U;

julia> function trianglelogdet(A)
           s = zero(eltype(A))
           @avx for n ∈ axes(A,1)
               s += log(A[n,n])
           end
           s
       end
trianglelogdet (generic function with 1 method)

julia> @btime trianglelogdet($U)
  243.315 ns (0 allocations: 0 bytes)
279.85258637906907

julia> @btime logdet($U)
  1.178 μs (0 allocations: 0 bytes)
279.85258637906907

julia> @btime sum(log, diag($U))
  1.257 μs (1 allocation: 1.77 KiB)
279.85258637906907

julia> s = 0.0; LoopVectorization.@avx_debug for n ∈ axes(U,1)
               s += log(U[n,n])
           end
OPS = Tuple{:LoopVectorization,:LOOPCONSTANTINSTRUCTION,LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01),:LoopVectorization,:getindex,LoopVectorization.OperationStruct(0x0000000000000011, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02),:LoopVectorization,:log,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000002, LoopVectorization.compute, 0x00, 0x03),:LoopVectorization,:vadd,LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000000103, LoopVectorization.compute, 0x00, 0x01)}
ARF = Tuple{LoopVectorization.ArrayRefStruct{:U,Symbol("##vptr##_U")}(0x0000000000000101, 0x0000000000000101, 0x0000000000000000)}
AM = Tuple{0,Tuple{4},Tuple{1},Tuple{},Tuple{},Tuple{},Tuple{}}
LPSYM = Tuple{:n}
LB = Tuple{VectorizationBase.StaticLowerUnitRange{1}}
vargs = (VectorizationBase.PackedStridedPointer{Float64,1}(Ptr{Float64} @0x000055adfdbc5940, (200,)), 0.0)
begin
    $(Expr(:meta, :inline))
    begin
        var"##n_loopstop#329" = (#= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:33 =# @inbounds(lb[1])).U
        var"##vptr##_U" = #= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:109 =# @inbounds(vargs[1])
        var"##symlicm#334" = #= /home/chriselrod/.julia/dev/LoopVectorization/src/reconstruct_loopset.jl:181 =# @inbounds(vargs[2])
        var"##Tloopeltype##" = eltype(var"##vptr##_U")
        var"##Wvecwidth##" = LoopVectorization.pick_vector_width_val(var"##Tloopeltype##")
        var"####op#330_" = var"##symlicm#334"
        var"##mask##" = LoopVectorization.mask(var"##Wvecwidth##", var"##n_loopstop#329")
        var"####op#330_0" = LoopVectorization.vzero(var"##Wvecwidth##", typeof(var"####op#330_"))
    end
    begin
        n = LoopVectorization._MM(var"##Wvecwidth##", 1)
        while LoopVectorization.scalar_less(n, var"##n_loopstop#329" - LoopVectorization.valsub(var"##Wvecwidth##", 2))
            var"####op#331_0" = LoopVectorization.vload(var"##vptr##_U", (n, n))
            var"####op#332_0" = LoopVectorization.log(var"####op#331_0")
            var"####op#330_0" = LoopVectorization.vadd(var"####op#330_0", var"####op#332_0")
            n = LoopVectorization.valadd(var"##Wvecwidth##", n)
        end
        if LoopVectorization.scalar_less(n, var"##n_loopstop#329" + 1)
            var"####op#331_0" = LoopVectorization.vload(var"##vptr##_U", (n, n), var"##mask##")
            var"####op#332_0" = LoopVectorization.log(var"####op#331_0")
            var"####op#330_0" = LoopVectorization.vifelse(var"##mask##", LoopVectorization.vadd(var"####op#330_0", var"####op#332_0"), var"####op#330_0")
            n = LoopVectorization.valadd(var"##Wvecwidth##", n)
        end
    end
end

julia> mylog(x) = log(x)
mylog (generic function with 1 method)

julia> function trianglemylogdet(A)
           s = zero(eltype(A))
           @avx for n ∈ 1:size(A,1)
               s += mylog(A[n,n])
           end
           s
       end
trianglemylogdet (generic function with 1 method)

julia> @btime trianglemylogdet($U)
  237.071 ns (0 allocations: 0 bytes)
279.85258637906907

But yeah, this with all your examples will be a lot slower than if I could translate them into loops, unless it's mapreduce where the map op is something expensive like log. Reductions without expensive map operations need to be unrolled for performance, because the reduction introduces a dependency chain in the acumulator. Unrolling breaks up that dependency chain, because it's then split up among multiple accumulators.

Adding support for mapreduce in the loops (and treating sum and friends like special cases) would be a good idea.

May 02 '20 17:05 chriselrod

Yes, for mapreduce it would be a good fit, mapfoldl/mapfoldr on the other hand make promises about reduction order, so there one should probably be careful.

May 02 '20 19:05 stev47

LoopVectorization.jl LoopVectorization.jl copied to clipboard

handle loop-generating expressions

LoopVectorization.jl
LoopVectorization.jl copied to clipboard