LoopVectorization.jl Performance of convolutions over 3-tuples

trafficstars

Ref: https://github.com/JuliaSIMD/LoopVectorization.jl/pull/341

@timholy (Feel free to also share your benchmark code)

I can run benchmarks on a few different computers. I also own Skylake-X/Cascadelake (Skylake-era AVX512 with 2x FMA), Haswell (AVX2), and Apple M1.

For now, I am running on:

julia> versioninfo()
Julia Version 1.8.0-DEV.593
Commit 53aa65ad3c* (2021-09-24 19:45 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz

Which has AVX512 with 1x FMA on port 0. In addition, it also has vector shuffles on port 5.

Contrast this with my Skylake-X/Cascadelake CPUs, which have vector shuffles on port 5, and FMAs on both ports 0 and 5. What this means is that on Skylake-X/Cascadelake, performing vector shuffles is using resources that could otherwise be spent on FMA. But on Tiger Lake, they're independent / fully parallel.

So, this means on Tiger Lake, we would ideally see comparable flops between greyscale (requires vector moves and fma) and array-of-structs [colors] (requires vector moves, shuffles, and fma).

Functions:

using LoopVectorization, OffsetArrays, BenchmarkTools
function smoothdim_kernel_tile!(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices, axkernel = axes(kernel, 1))
  # z is an "overflow-safe zero". This is needed in cases where both `src` and `kernel` have eltypes vulnerable to arithmetic overflow.
  @fastmath @inbounds for Ipost in Rpost
    for i in axout_tile
      for Ipre in Rpre
        tmp = convert(eltype(out), z)
        for j in axkernel
          tmp += oftype(z, src[Tuple(Ipre)...,i+j,Tuple(Ipost)...])*kernel[j]
        end
        out[Tuple(Ipre)...,i,Tuple(Ipost)...] = tmp
      end
    end
  end
  out
end
function smoothdim_kernel_tile_sa!(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices, axkernel = LoopVectorization.ArrayInterface.static(-6):LoopVectorization.ArrayInterface.static(6))
  # z is an "overflow-safe zero". This is needed in cases where both `src` and `kernel` have eltypes vulnerable to arithmetic overflow.
  @fastmath @inbounds for Ipost in Rpost
    for i in axout_tile
      for Ipre in Rpre
        tmp = convert(eltype(out), z)
        @simd ivdep for j in axkernel
          tmp += oftype(z, src[Tuple(Ipre)...,i+j,Tuple(Ipost)...])*kernel[j]
        end
        out[Tuple(Ipre)...,i,Tuple(Ipost)...] = tmp
      end
    end
  end
  out
end
function smoothdim_kernel_tile_avx_inner!(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices)
  axkernel = axes(kernel, 1)
  zout = convert(eltype(out), z)
  for Ipost in Rpost
    for i in axout_tile
      @turbo for Ipre in Rpre
        tmp = zout
        # tmp = convert(eltype(out), z)    # failing to hoist this leads to an "UndefVarError: tmp not defined"
        for j in axkernel
          tmp += oftype(z, src[Ipre,i+j,Ipost])*kernel[j]
        end
        out[Ipre,i,Ipost] = tmp
      end
    end
  end
  out
end
function smoothdim_kernel_tile_avx_middle!(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices)
  axkernel = axes(kernel, 1)
  zout = convert(eltype(out), z)
  for Ipost in Rpost
    @turbo for i in axout_tile
      for Ipre in Rpre
        tmp = zout
        # tmp = convert(eltype(out), z)    # failing to hoist this leads to an "UndefVarError: tmp not defined"
        for j in axkernel
          tmp += oftype(z, src[Ipre,i+j,Ipost])*kernel[j]
        end
        out[Ipre,i,Ipost] = tmp
      end
    end
  end
  out
end
function smoothdim_kernel_tile_avx_outer!(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices)
  axkernel = axes(kernel, 1)
  zout = convert(eltype(out), z)
  @turbo for Ipost in Rpost
    for i in axout_tile
      for Ipre in Rpre
        tmp = zout
        # tmp = convert(eltype(out), z)    # failing to hoist this leads to an "UndefVarError: tmp not defined"
        for j in axkernel
          tmp += oftype(z, src[Ipre,i+j,Ipost])*kernel[j]
        end
        out[Ipre,i,Ipost] = tmp
      end
    end
  end
  out
end

The reason for the Tuple(Ipost)... type indexing is that this was needed for type stability with 5-d arrays. Note that the middle and inner variants of the @turbo functions need some work for type stability, too, but this should be fixable within LoopVectorization.

This is probably a new regression in Julia 1.8.

Benchmarks code for d=1, Float32 and then NTuple{3,Float32}:

T = Float32; M = 11; σ = 3;
ax = OffsetArray(-6:6, -6:6);
kernel = float(T)[exp(-i^2/(2*σ^2)) for i in ax];   # 1-d Gaussian kernel
kernel ./= sum(kernel);
srcax = 1+firstindex(ax):M+lastindex(ax);

Mpad = length(srcax);
src = OffsetArray(rand(T,Mpad,Mpad,Mpad),srcax,srcax,srcax);
dest1 = Array{float(T),3}(undef,M,M,M);
dest2 = similar(dest1);

d=1;
Rpre  = CartesianIndices(axes(dest1)[1:d-1]);
Rpost = CartesianIndices(axes(dest1)[d+1:end]);
smoothdim_kernel_tile!(    dest1, float(zero(T)), src, kernel, Rpre, axes(dest1, d), Rpost);
smoothdim_kernel_tile_avx_inner!(dest2, float(zero(T)), src, kernel, Rpre, axes(dest2, d), Rpost);
dest1 ≈ dest2

@benchmark smoothdim_kernel_tile!(    $dest1, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest1, $d), $Rpost)
@benchmark smoothdim_kernel_tile_sa!(    $dest1, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest1, $d), $Rpost)
@benchmark smoothdim_kernel_tile_avx_inner!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)
@benchmark smoothdim_kernel_tile_avx_middle!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)
@benchmark smoothdim_kernel_tile_avx_outer!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)

src_t = OffsetArray([(rand(T),rand(T),rand(T)) for i in srcax, j in srcax, k in srcax], srcax, srcax, srcax);
srcr = reinterpret(reshape, T, src_t);
fT = float(T);
dest1_t = Array{Tuple{fT,fT,fT}}(undef, M, M, M);
dest2_t = similar(dest1_t);
dest1r = reinterpret(reshape, fT, dest1_t);
dest2r = reinterpret(reshape, fT, dest2_t);

d = 1;
Rprer  = CartesianIndices((LoopVectorization.Static(1):LoopVectorization.Static(3), axes(dest1_t)[1:d-1]...));
Rpostr = CartesianIndices(axes(dest1_t)[d+1:end]);
smoothdim_kernel_tile!(    dest1r, float(zero(T)), srcr, kernel, Rprer, axes(dest1_t, d), Rpostr);
smoothdim_kernel_tile_avx_inner!(dest2r, float(zero(T)), srcr, kernel, Rprer, axes(dest1_t, d), Rpostr);

@benchmark smoothdim_kernel_tile!(    $dest1r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest1_t, $d), $Rpostr)
@benchmark smoothdim_kernel_tile_sa!(    $dest1r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest1_t, $d), $Rpostr)
@benchmark smoothdim_kernel_tile_avx_inner!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
@benchmark smoothdim_kernel_tile_avx_middle!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
@benchmark smoothdim_kernel_tile_avx_outer!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)

The sa variant is using a static axis and @simd ivdep to encourage LLVM to SIMD.

Among the @turbo variants, the inner are like those from the tests. The middle and outer move the @turbo to outer loops, giving LV more options for optimization. This is particularly relevant for d=1, where most of the loops are outer-loops, meaning LV doesn't have many options when @turbo is applied toward the inside. In fact, when d=1, Rpre is 0-dimensional, meaning the only loop LV can optimize is the inner-most loop.

Benchmark results for Float32:

julia> @benchmark smoothdim_kernel_tile!(    $dest1, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest1, $d), $Rpost)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.661 μs …  10.469 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.753 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.888 μs ± 284.296 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆▆  ▃█▆ ▃▆▁  ▂▄▁ ▂▃                          ▂▇▄            ▂
  ███▁███▅███▆▅███▇███▆▇▇▇▆▇▆▅▅▆▆▆▆▃▅▄▆▅▅▅▃▇▆▅▅███▇▆▆▇▆▇▄▅▅▇▇ █
  5.66 μs      Histogram: log(frequency) by time      6.56 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_sa!(    $dest1, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest1, $d), $Rpost)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.445 μs …  4.076 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.562 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.545 μs ± 68.339 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▁            █▆
  ▃▅▅▃▂▁▁▁▁▁▁▂▂▄▅█▅▂▁▁▁▁▂▂▂▃▇███▃▂▁▂▂▂▂▂▂▂▂▁▂▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂ ▃
  2.44 μs        Histogram: frequency by time        2.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_inner!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.960 μs …  13.824 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.210 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.184 μs ± 137.116 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▃▄▂        ▅█▅
  ▂▂▃▄▄▃▂▂▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▅███▆▃▂▂▂▂▂▆████▄▃▂▂▂▂▂▂▂▂▂▃▂▂▂▂▂ ▃
  2.96 μs         Histogram: frequency by time        3.35 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_middle!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.067 μs …  2.597 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.123 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.130 μs ± 64.745 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                  ▅█      ▂▆
  ▂▆▆▂▁▁▁▂▃▅▃▁▁▁▁▂██▄▂▁▂▁▂██▄▂▂▂▁▁▂▂▁▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.07 μs        Histogram: frequency by time        1.25 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_outer!($dest2, $(float(zero(T))), $src, $kernel, $Rpre, axes($dest2, $d), $Rpost)
BenchmarkTools.Trial: 10000 samples with 187 evaluations.
 Range (min … max):  548.583 ns …  3.580 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     596.374 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   592.741 ns ± 32.537 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▂           ▁   ▂▁     ▄▇▇▁ ▃▄▂ ▁▁▁▂▅██▄▁▃▅▄▁        ▂
  ▆▁▁▁▁▁▁▇█▇▁▃▆▆▃▁▁▃▅▇█▆▃▅██▃▁▅▆▇████▇█████████████████▇▆▄▅▃▇█ █
  549 ns        Histogram: log(frequency) by time       612 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

As we may expect, SIMD + static size information helped LLVM. Static size info should help LV too. But even without LV won handily when able to optimize the middle and outer loops.

As we may expect, performance was best when applied to the outer-most loop. However, it seems to be mis-optimizing when using NTuple{3,Float32}:

julia> @benchmark smoothdim_kernel_tile!(    $dest1r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest1_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  40.116 μs … 599.252 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     41.064 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   41.228 μs ±   5.685 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    █
  ▄▄▂▂▁▁▁▁▁▁▁▁▁▁▁▁▂▃█▃▂▁▂▂▂▂▁▂▂▂▂▂▂▂▂▂▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▂
  40.1 μs         Histogram: frequency by time         43.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_sa!(    $dest1r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest1_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.721 μs …  22.268 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.812 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.950 μs ± 537.405 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▇
  ▂▇██▅▂▂▂▂▁▂▂▃▆█▅▂▂▂▁▁▁▂▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▂▁▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂ ▃
  13.7 μs         Histogram: frequency by time         15.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_inner!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.681 μs …  33.781 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.854 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.919 μs ± 612.790 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█  ▄▇▇▂ ▂▄                  ▂▁                             ▂
  ███▁████▁███▁▃▁▁▁▁▁▅▇▇▃▁▆▇▇▅▇██▁▁▇█▆▁▁▃▄▁▄▁▁▄▁▁▃▁▁▅▄▁▃▃▁▄▃▄ █
  9.68 μs      Histogram: log(frequency) by time      12.1 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_middle!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  3.032 μs …  5.296 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.178 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.154 μs ± 99.313 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                   ▂              ▆█▂
  ▂▃▄▄▃▂▂▂▁▁▁▁▁▁▂▃███▄▂▂▂▂▂▂▂▂▂▂▃▇███▅▃▂▂▂▂▂▂▂▂▂▁▁▁▂▂▂▂▂▂▂▂▂ ▃
  3.03 μs        Histogram: frequency by time        3.29 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_outer!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  6.455 μs …  14.940 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     6.636 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   6.662 μs ± 330.102 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▂▃▄▅▆▇██▇▅▂   ▁▂▂▂                                      ▂
  ▃▇▆████████████▇▇██████▆▄▄▅▃▅▁▃▅▃▁▄▃▁▁▁▁▃▁▁▁▃▃▃▁▁▃▁▁▁▃▃▃▄▅▅ █
  6.45 μs      Histogram: log(frequency) by time      7.43 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

The middle variant was 3x slower than it was for Float32, but about 5.5x slower than applying @turbo to the outer loop for Float32. The outer loop version for NTuple{3,Float32} though regressed, being over 2x slower than applying to the middle.

Investigating...

function smoothdim_kernel_tile_avx_middle_ls(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices)
  axkernel = axes(kernel, 1)
  zout = convert(eltype(out), z)
  let Ipost = first(Rpost)
    return LoopVectorization.@turbo_debug for i in axout_tile
      for Ipre in Rpre
        tmp = zout
        # tmp = convert(eltype(out), z)    # failing to hoist this leads to an "UndefVarError: tmp not defined"
        for j in axkernel
          tmp += oftype(z, src[Ipre,i+j,Ipost])*kernel[j]
        end
        out[Ipre,i,Ipost] = tmp
      end
    end
  end
end
function smoothdim_kernel_tile_avx_outer_ls(out, z, src::AbstractArray, kernel::AbstractVector, Rpre::CartesianIndices, axout_tile, Rpost::CartesianIndices)
  axkernel = axes(kernel, 1)
  zout = convert(eltype(out), z)
  LoopVectorization.@turbo_debug for Ipost in Rpost
    for i in axout_tile
      for Ipre in Rpre
        tmp = zout
        # tmp = convert(eltype(out), z)    # failing to hoist this leads to an "UndefVarError: tmp not defined"
        for j in axkernel
          tmp += oftype(z, src[Ipre,i+j,Ipost])*kernel[j]
        end
        out[Ipre,i,Ipost] = tmp
      end
    end
  end
end

ls_middle = smoothdim_kernel_tile_avx_middle_ls(dest1, float(zero(T)), src, kernel, Rpre, axes(dest1, d), Rpost);
ls_middle_r = smoothdim_kernel_tile_avx_middle_ls(dest1r, float(zero(T)), srcr, kernel, Rprer, axes(dest1_t, d), Rpostr);

ls_outer = smoothdim_kernel_tile_avx_outer_ls(dest1, float(zero(T)), src, kernel, Rpre, axes(dest1, d), Rpost);
ls_outer_r = smoothdim_kernel_tile_avx_outer_ls(dest1r, float(zero(T)), srcr, kernel, Rprer, axes(dest1_t, d), Rpostr);

LoopVectorization.choose_order_cost(ls_middle)
LoopVectorization.choose_order_cost(ls_middle_r)
LoopVectorization.choose_order_cost(ls_outer)
LoopVectorization.choose_order_cost(ls_outer_r)

julia> LoopVectorization.choose_order_cost(ls_middle)
([:i, :j], :i, :j, :i, 2, 8, 1.0633290809765626e7, false)

julia> LoopVectorization.choose_order_cost(ls_middle_r)
([:i, Symbol("Ipre#1#"), :j], :i, :j, Symbol("Ipre#1#"), 6, 7, 1.463593029812362e6, false)

julia> LoopVectorization.choose_order_cost(ls_outer)
([Symbol("Ipost#2#"), Symbol("Ipost#1#"), :i, :j], :i, Symbol("Ipost#2#"), :i, 2, 8, 1.117807007143084e13, false)

julia> LoopVectorization.choose_order_cost(ls_outer_r)
([:i, Symbol("Ipost#2#"), Symbol("Ipost#1#"), Symbol("Ipre#1#"), :j], :i, :j, Symbol("Ipost#1#"), 6, 7, 7.524636336629878e11, false)

It seems to be unrolling a lot more aggresively with tuples.

Also, looking at the assembly, it is using gathers instead of shuffles because it's not vectorizing i.

Sep 25 '21 19:09 chriselrod

I get basically the same results on cascadelake as on tigerlake.

For reference, middle is vectorizing Ipre#1#, i.e. it is vectorizing the tuple of length 3. This is pretty wasteful with AVX512, but seems to be the fastest.

I experimented with getting it to use shuffles instead, which requires a different unroll/vectorization combination. This was slower, indicating it was in fact better to vectorize Ipost, allowing the unrolling and i and j to reuse the slow-to-gather vectors, as LV was choosing to do, instead of vectorizing i -- preventing reuse -- but making the loading cheaper. What LV was wrong about though is that apparently it is even better to just load contiguous vectors of length 3 instead of discontiguous vectors of length 16 + some trick to try and make it cheaper. LV didn't have the option of vectorizing Ipost in middle so ended up doing this there.

So when it comes to fiddling with costs, the answer might be to increase the cost of discontiguous loads.

However, this is also a function of loop lengths. M = 11, so using vectors of length 16 like on AVX512 are going to be at a disadvantage. This is sort of a corner case, and I'm not sure if we should way it so heavily in changing the algorithm. It might be worth adding a means of giving suggested loop lengths that its optimizer can use.

EDIT: M=16 improves the ratio a bit, but vectorizing Rpre is still the winner:

julia> @benchmark smoothdim_kernel_tile_avx_middle!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  9.019 μs …  41.290 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.065 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.121 μs ± 543.360 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▇█▃       ▂▁  ▂                                            ▁
  ▆███▇█▄▁▁▁▅██▄▇██▆▆▆▆▅▄▅▄▅▅▅▄▃▅▅▄▅▁▃▁▁▃▄▄▁▁▃▁▃▃▄▄▄▄▃▃▁▅█▄▃▅ █
  9.02 μs      Histogram: log(frequency) by time      10.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark smoothdim_kernel_tile_avx_outer!($dest2r, float(zero($T)), $srcr, $kernel, $Rprer, axes($dest2_t, $d), $Rpostr)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.459 μs …  36.150 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     13.596 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   13.655 μs ± 710.729 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      ▄█▆▂
  ▂▂▃▇████▅▃▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▁▁▁▂▂▂ ▃
  13.5 μs         Histogram: frequency by time         14.9 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Sep 25 '21 23:09 chriselrod

LinuxPerf comparison:

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(smoothdim_kernel_tile_avx_middle!, 100_000, dest2r, float(zero(T)), srcr, kernel, Rprer, axes(dest2_t, d), Rpostr)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.15e+09  100.0%  #  4.7 cycles per ns
└ task-clock               8.86e+08  100.0%  # 885.6 ms
┌ instructions             1.54e+10  100.0%  #  3.7 insns per cycle
│ branch-instructions      1.16e+09  100.0%  #  7.5% of instructions
└ branch-misses            1.71e+06  100.0%  #  0.1% of branch instructions
┌ L1-dcache-load-misses    2.19e+08  100.0%  #  6.0% of dcache loads
│ L1-dcache-loads          3.65e+09  100.0%
│ cache-misses             2.33e+03  100.0%  # 61.0% of cache references
└ cache-references         3.82e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(smoothdim_kernel_tile_avx_outer!, 100_000, dest2r, float(zero(T)), srcr, kernel, Rprer, axes(dest2_t, d), Rpostr)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               7.19e+09  100.0%  #  4.6 cycles per ns
└ task-clock               1.57e+09  100.0%  #  1.6 s
┌ instructions             5.16e+09  100.0%  #  0.7 insns per cycle
│ branch-instructions      1.75e+08  100.0%  #  3.4% of instructions
└ branch-misses            4.91e+05  100.0%  #  0.3% of branch instructions
┌ L1-dcache-load-misses    4.92e+08  100.0%  # 24.9% of dcache loads
│ L1-dcache-loads          1.98e+09  100.0%
│ cache-misses             3.20e+03  100.0%  # 74.1% of cache references
└ cache-references         4.31e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Outer executes less than half the instructions (although we may've expected/hoped for a larger gap). However, only a small fraction the instructions/cycle, and greater L1-dcache misses.

Sep 26 '21 00:09 chriselrod

Just benchmarked on Haswell (AVX2, FMA, but no AVX512; relatively slow gather & no scatter), and I see the same basic pattern as on Tiger Lake.

Sep 26 '21 00:09 chriselrod

@timholy (Feel free to also share your benchmark code)

Try the teh/lv2 branch of ImageFiltering, which likely requires https://github.com/JuliaArrays/ArrayInterface.jl/pull/210 and the master branch of TiledIteration (you need at least https://github.com/JuliaArrays/TiledIteration.jl/pull/32). Other than that all versions are released:

(ImageFiltering) pkg> st
     Project ImageFiltering v0.7.0
      Status `~/.julia/dev/ImageFiltering/Project.toml`
  [4fba245c] ArrayInterface v3.1.33 `~/.julia/dev/ArrayInterface`
  [aafaddc9] CatIndices v0.2.2
  [ed09eef8] ComputationalResources v0.3.2
  [864edb3b] DataStructures v0.18.10
  [4f61f5a4] FFTViews v0.3.1
  [7a1cc6ca] FFTW v1.4.5
  [a09fc81d] ImageCore v0.9.3
  [bdcacae8] LoopVectorization v0.12.77
  [6fe1bfb0] OffsetArrays v1.10.7
  [189a3867] Reexport v1.2.2
  [90137ffa] StaticArrays v1.2.12
  [06e1c1a7] TiledIteration v0.4.0 `~/.julia/dev/TiledIteration`
  [37e2e46d] LinearAlgebra
  [2f01184e] SparseArrays
  [10745b16] Statistics

Then you can try this benchmark:

julia> using ImageFiltering, ImageCore

julia> run_blur(img) = imfilter(img, KernelFactors.gaussian(ntuple(_->5.0f0, ndims(img)), ntuple(_->21, ndims(img))))
run_blur (generic function with 1 method)

julia> img = rand(RGB{N0f8}, 2048, 2048);

julia> run_blur(img);

That run_blur definition creates a gaussian filter with σ=5pixels in both axes.

A bit of background on the design: ImageFiltering uses a buffer-swap paradigm for separable convolution (well, correlation actually). Basically imfilter!(buf, A, kernel1) followed by imfilter!(out, buf, kernel2). It defines ReshapedOneD objects that act like a reshaped one-dimensional vector, so kernel1 has extent along the first filtering axis and kernel2 along the second. (They are not actually AbstractArrays because the package also supports IIR filtering and that can't really be thought of as an AbstractVector.) Apologies for the lack of a decent Base.show method for them, I noticed that omission recently and will fix soon.

What makes it more complicated is that TiledIteration tries to keep the intermediate buf in L1 cache, and so processes these in small chunks. It doesn't have access to hardware info, though, it was just hardcoded years ago and hence is probably quite conservative. On the agenda is to look into HostCPUFeatures and see if I can get it from there. It also involves some duplication of work, and for certain kernel sizes that's probably a bad thing. But for many practical cases it's a pretty good approach. There are some discussions of this (dated by modern CPU standards...) in https://people.csail.mit.edu/jrk/halide-pldi13.pdf. It would be good, though, for ImageFiltering to have a few different options implemented and branch among them depending on parameters.

Sep 26 '21 12:09 timholy

On the agenda is to look into HostCPUFeatures and see if I can get it from there.

Names could be a bit more informative, but --

HostCPUFeatures.jl is for the instruction set side of things, queried from LLVM.
CPUSummary.jl is for information from Hwloc.jl, such as cache sizes or number of cores.

They both define functions when precompiling, and then in __init__ check to make sure the functions are correct. If not, they @eval to correct them, relying on invalidation to fix anything dependent on them.

julia> using CPUSummary

julia> CPUSummary.cache_size(Val(1))
static(32768)

julia> Sys.CPU_NAME
"cascadelake"

Most CPUs still have 32 KiB L1d caches, but Intel Ice Lake and Tiger Lake have moved to 48 KiB L1d.

julia> using CPUSummary

julia> CPUSummary.cache_size(Val(1))
static(49152)

julia> Sys.CPU_NAME
"tigerlake"

And the M1 has 128 KiB L1d on its performance cores (these values are hard coded, because Hwloc reports on the efficiency cores instead):

julia> using CPUSummary

julia> CPUSummary.cache_size(Val(1))
static(131072)

julia> versioninfo()
Julia Version 1.8.0-DEV.590
Commit 2c9e051c46* (2021-09-23 21:35 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin20.6.0)
  CPU: Apple M1

So there is some diversity in cache sizes.

Sep 27 '21 01:09 chriselrod

LoopVectorization.jl LoopVectorization.jl copied to clipboard

Performance of convolutions over 3-tuples

LoopVectorization.jl
LoopVectorization.jl copied to clipboard