LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Performance for stride 2
Hello,
I tried to leverage the speedups from @turbo
for a case with stride-2 access of the input data . This actually seemed to degrade performance compared to @inbounds
, and it got even worse with @tturbo
. I realize that it is much more tricky with non-contiguous data, but wasn't expecting this much of a degradation. Didn't manage to find much info, so perhaps it is a rare use case. Is this to be expected or should something be done differently?
I tried to boil it down to the following mwe.
Thanks!
using LoopVectorization
function test_stride2_inbounds(out, x)
@inbounds for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride2_turbo(out, x)
@turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride2_tturbo(out, x)
@tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[2k-1]
acc += x[2k-2]
acc += x[2k-3]
acc += x[2k-4]
acc += x[2k-5]
out[k] = acc
end
return out
end
function test_stride1_turbo(out, x)
@turbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[k-1]
acc += x[k-2]
acc += x[k-3]
acc += x[k-4]
acc += x[k-5]
out[k] = acc
end
return out
end
function test_stride1_tturbo(out, x)
@tturbo for k = 3:length(x)÷2 # @turbo does not work, unsure why
acc = 0
acc += x[k-1]
acc += x[k-2]
acc += x[k-3]
acc += x[k-4]
acc += x[k-5]
out[k] = acc
end
return out
end
x = rand(-1000:1000, 2^17)
out1, out2, out3, out4, out5 = (similar(x) for _=1:5)
println("# Threads = $(Threads.nthreads())")
println("\ntest_stride2_inbounds:")
display( @benchmark test_stride2_inbounds(out1, x) )
println("\ntest_stride2_turbo:")
display( @benchmark test_stride2_turbo(out2, x) )
println("\ntest_stride2_tturbo:")
display( @benchmark test_stride2_tturbo(out3, x) )
println("\ntest_stride1_turbo:")
display( @benchmark test_stride1_turbo(out4, x) )
println("\ntest_stride1_tturbo:")
display( @benchmark test_stride1_tturbo(out5, x) )
# Threads = 10
test_stride2_inbounds:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 53.300 μs … 867.500 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 56.800 μs ┊ GC (median): 0.00%
Time (mean ± σ): 59.665 μs ± 23.947 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▃ ▁ █▄▁ ▁▁ ▃▇ ▄▇▂ ▂▁ ▄▁ ▁ ▂
██▇▅███▇▅███▆▇███▇▆███▅▇▆█████▅██▇█▆▆▅▅██▇▆▅▁█▆▅▄▆▅▄▆▄▄▁▄▄▁▅ █
53.3 μs Histogram: log(frequency) by time 74.6 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
test_stride2_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 80.900 μs … 3.360 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 110.400 μs ┊ GC (median): 0.00%
Time (mean ± σ): 134.626 μs ± 133.504 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▄██▄▂▂▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▂▂▂▂▂▁▁▁▂▂▂▂▁▂▁▂▂▁▂▂▂▂▂▂▂▁▂▂▂▂▁▂ ▂
80.9 μs Histogram: frequency by time 1.01 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
test_stride2_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 181.400 μs … 10.707 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 209.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 242.770 μs ± 203.921 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▅█▅▃▃▃▂▁▁▁▁ ▁
██████████████▇▇▇█▇▇▇▇▇▆▆▆▇▅▆▅▄▆▄▆▅▆▆▄▄▃▄▅▅▄▅▅▅▆▃▅▅▅▅▅▅▅▄▄▄▅▃ █
181 μs Histogram: log(frequency) by time 938 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
test_stride1_turbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 24.900 μs … 956.400 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 31.100 μs ┊ GC (median): 0.00%
Time (mean ± σ): 32.754 μs ± 30.490 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▅▄▂▁ ▄█▆▇▇▆▃▂▁▆▄▂▄▃▁▁ ▂
██████▇▇▆▆████████████████▇█▇█▇▇▇▆▇▅▅▅▅▃▄▂▃▃▃▄▅▄▃▄▃▄▅▃▄▃▃▃▃▂ █
24.9 μs Histogram: log(frequency) by time 52 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
test_stride1_tturbo:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 12.100 μs … 2.709 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 15.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 21.181 μs ± 51.142 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▆▃▄▁▁▂ ▂
███████████▇▇▆▇▇▆▆▆▅▅▅▅▅▅▅▅▆▅▅▃▄▅▄▄▅▃▅▅▅▄▄▃▁▄▅▅▅▅▅▅▅▅▅▁▃▄▄▅ █
12.1 μs Histogram: log(frequency) by time 132 μs <
Memory estimate: 0 bytes, allocs estimate: 0.