LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

Increased execution time bug (Ryzen CPU only?)

Open SalmonLA opened this issue 4 years ago • 2 comments
trafficstars

Hi, I have found a slowdown upon using the @turbo macro

using LoopVectorization,BenchmarkTools

function testSum(indices,Vectors)
    A,B,C,D = Vectors
    sum = 0.
    for ksum in eachindex(indices)
        k = indices[ksum]
        sum += A[k] * B[k] +C[k] * D[k]
    end
    return sum
end

function testSum_avx(indices,Vectors)
    A,B,C,D = Vectors
    sum = 0.
    @turbo for ksum in eachindex(indices)
        k = indices[ksum]
        sum += A[k] * B[k] +C[k] * D[k]
    end
    return sum
end

N = 63
Vectors = Tuple(rand(N) for _ in 1:4);
indices = rand(1:N,500);

@btime testSum($indices,$Vectors)
@btime testSum_avx($indices,$Vectors)

The output is:

  510.417 ns (0 allocations: 0 bytes)
  871.698 ns (0 allocations: 0 bytes)

I found this behaviour on two Ryzen CPU's (Ryzen 2700 and Ryzen 5 3500U). On an intel CPU (Xeon(R) Gold 6130), the avx version does run faster.

For reference, I have posted a question on this behaviour on https://discourse.julialang.org/t/loopvectorization-almost-doubles-execution-time/64333.

SalmonLA avatar Jul 09 '21 14:07 SalmonLA

To copy what I said on discourse: Regarding performance of different architectures, here is a table giving the performance of 256-bit gathers using 4 x Int64 indices and loading 4 x Float64. The throughput figures give the reciprocal throughput. Lower is better; you can interpret this as the average number of clock cycles per completion of such an instruction when you're executing a lot of them. The latency gives the actual time for an individual instruction, but many instructions can be executed in parallel, hence the throughput is the most useful figure here.

Arch RThroughput
Zen+ 12
Zen2 9
Zen3 4
Haswell 8
Skylake 4
Skylake-X 4
Tiger Lake 3

Zen+, like the CPUs you tested on, are very bad here and really ought to avoid using any gather instructions. Haswell and Zen2 are bad, too. The Xeon(R) Gold 6130 is Skylake-X, while the Ryzen 2700 and 3500U are both Zen+.

The fix in LoopVectorization will be to add the option to not vectorize code at all, and then have it avoid vectorization on such loops when we expect not to see a speedup or even a slow down due to bad performance of the involved instructions.

chriselrod avatar Jul 09 '21 15:07 chriselrod

I'm reopening this. We can close it once LV gets good performance across the architectures.

chriselrod avatar Jul 16 '21 14:07 chriselrod