LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Increased execution time bug (Ryzen CPU only?)
Hi,
I have found a slowdown upon using the @turbo macro
using LoopVectorization,BenchmarkTools
function testSum(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
function testSum_avx(indices,Vectors)
A,B,C,D = Vectors
sum = 0.
@turbo for ksum in eachindex(indices)
k = indices[ksum]
sum += A[k] * B[k] +C[k] * D[k]
end
return sum
end
N = 63
Vectors = Tuple(rand(N) for _ in 1:4);
indices = rand(1:N,500);
@btime testSum($indices,$Vectors)
@btime testSum_avx($indices,$Vectors)
The output is:
510.417 ns (0 allocations: 0 bytes)
871.698 ns (0 allocations: 0 bytes)
I found this behaviour on two Ryzen CPU's (Ryzen 2700 and Ryzen 5 3500U). On an intel CPU (Xeon(R) Gold 6130), the avx version does run faster.
For reference, I have posted a question on this behaviour on https://discourse.julialang.org/t/loopvectorization-almost-doubles-execution-time/64333.
To copy what I said on discourse:
Regarding performance of different architectures, here is a table giving the performance of 256-bit gathers using 4 x Int64 indices and loading 4 x Float64.
The throughput figures give the reciprocal throughput. Lower is better; you can interpret this as the average number of clock cycles per completion of such an instruction when you're executing a lot of them.
The latency gives the actual time for an individual instruction, but many instructions can be executed in parallel, hence the throughput is the most useful figure here.
| Arch | RThroughput |
|---|---|
| Zen+ | 12 |
| Zen2 | 9 |
| Zen3 | 4 |
| Haswell | 8 |
| Skylake | 4 |
| Skylake-X | 4 |
| Tiger Lake | 3 |
Zen+, like the CPUs you tested on, are very bad here and really ought to avoid using any gather instructions. Haswell and Zen2 are bad, too. The Xeon(R) Gold 6130 is Skylake-X, while the Ryzen 2700 and 3500U are both Zen+.
The fix in LoopVectorization will be to add the option to not vectorize code at all, and then have it avoid vectorization on such loops when we expect not to see a speedup or even a slow down due to bad performance of the involved instructions.
I'm reopening this. We can close it once LV gets good performance across the architectures.