LoopVectorization.jl Performance for stride 2

Performance for stride 2

Open olof3 opened this issue 1 year ago • 0 comments

Hello,

I tried to leverage the speedups from @turbo for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to @inbounds, and it got even worse with @tturbo. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?

I tried to boil it down to the following mwe.

Thanks!

using LoopVectorization

function test_stride2_inbounds(out, x)
    @inbounds for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end

function test_stride2_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[2k-1]
        acc += x[2k-2]
        acc += x[2k-3]
        acc += x[2k-4]
        acc += x[2k-5]
        out[k] = acc
    end

    return out
end


function test_stride1_turbo(out, x)
    @turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end

function test_stride1_tturbo(out, x)
    @tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
        acc = 0
        acc += x[k-1]
        acc += x[k-2]
        acc += x[k-3]
        acc += x[k-4]
        acc += x[k-5]
        out[k] = acc
    end

    return out
end


x = rand(-1000:1000, 2^17)
out1, out2, out3, out4, out5 = (similar(x) for _=1:5)


println("# Threads = $(Threads.nthreads())")

println("\ntest_stride2_inbounds:")
display( @benchmark test_stride2_inbounds(out1, x) )
println("\ntest_stride2_turbo:")
display( @benchmark test_stride2_turbo(out2, x) )
println("\ntest_stride2_tturbo:")
display( @benchmark test_stride2_tturbo(out3, x) )
println("\ntest_stride1_turbo:")
display( @benchmark test_stride1_turbo(out4, x) )
println("\ntest_stride1_tturbo:")
display( @benchmark test_stride1_tturbo(out5, x) )

# Threads = 10

test_stride2_inbounds:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  53.300 μs … 867.500 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     56.800 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   59.665 μs ±  23.947 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▃   ▁   █▄▁  ▁▁   ▃▇    ▄▇▂   ▂▁      ▄▁    ▁               ▂
  ██▇▅███▇▅███▆▇███▇▆███▅▇▆█████▅██▇█▆▆▅▅██▇▆▅▁█▆▅▄▆▅▄▆▄▄▁▄▄▁▅ █
  53.3 μs       Histogram: log(frequency) by time      74.6 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   80.900 μs …   3.360 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     110.400 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   134.626 μs ± 133.504 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   █   
  ▄██▄▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▂▁▂▁▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▁▂ ▂
  80.9 μs          Histogram: frequency by time         1.01 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride2_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  181.400 μs …  10.707 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     209.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   242.770 μs ± 203.921 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █▅█▅▃▃▃▂▁▁▁▁                                                  ▁
  ██████████████▇▇▇█▇▇▇▇▇▆▆▆▇▅▆▅▄▆▄▆▅▆▆▄▄▃▄▅▅▄▅▅▅▆▃▅▅▅▅▅▅▅▄▄▄▅▃ █
  181 μs        Histogram: log(frequency) by time        938 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  24.900 μs … 956.400 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     31.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   32.754 μs ±  30.490 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅▅▄▂▁     ▄█▆▇▇▆▃▂▁▆▄▂▄▃▁▁                                   ▂
  ██████▇▇▆▆████████████████▇█▇█▇▇▇▆▇▅▅▅▅▃▄▂▃▃▃▄▅▄▃▄▃▄▅▃▄▃▃▃▃▂ █
  24.9 μs       Histogram: log(frequency) by time        52 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

test_stride1_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.100 μs …  2.709 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     15.500 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   21.181 μs ± 51.142 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▆█▆▃▄▁▁▂                                                    ▂
  ███████████▇▇▆▇▇▆▆▆▅▅▅▅▅▅▅▅▆▅▅▃▄▅▄▄▅▃▅▅▅▄▄▃▁▄▅▅▅▅▅▅▅▅▅▁▃▄▄▅ █
  12.1 μs      Histogram: log(frequency) by time       132 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Dec 15 '23 08:12 olof3

LoopVectorization.jl LoopVectorization.jl copied to clipboard

Performance for stride 2

LoopVectorization.jl
LoopVectorization.jl copied to clipboard