LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

Ryzen dot product performance

Open dextorious opened this issue 5 years ago • 1 comments
trafficstars

First of all, brilliant work on this package. Extremely impressive.

Since you mention in the readme that you were curious about 1 fma Ryzen chips, here's data from a Ryzen 2700X running Julia 1.4.0-rc2.0 on Windows 10 x64 and the latest release (v0.6.21) of the package:

# make sure we get enough samples
julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1e12

julia> a = rand(256); b = rand(256);

julia> @btime mydot($a,$b)
  33.098 ns (0 allocations: 0 bytes)
55.14639163783218

julia> @btime mydotavx($a,$b)
  33.802 ns (0 allocations: 0 bytes)
55.14639163783219

julia> @btime myselfdot($a)
  22.066 ns (0 allocations: 0 bytes)
79.1990129346761

julia> @btime myselfdotavx($a)
  22.868 ns (0 allocations: 0 bytes)
79.1990129346761

julia> a = rand(255); b = rand(255);

julia> @btime mydot($a,$b)
  43.749 ns (0 allocations: 0 bytes)
65.04807877890542

julia> @btime mydotavx($a,$b)
  39.274 ns (0 allocations: 0 bytes)
65.04807877890542

julia> @btime myselfdot($a)
  38.608 ns (0 allocations: 0 bytes)
80.52310652021393

julia> @btime myselfdotavx($a)
  32.863 ns (0 allocations: 0 bytes)
80.52310652021396

So, the single load versions are always significantly faster, although the total throughput is always quite low compared to the numbers in your readme, which I assume came from a quad channel Skylake-X system or a comparable Xeon.

Let me know if you'd like any further info, whether the LLVM IR / native assembly, details of the hardware involved or for me to run any further benchmarks!

dextorious avatar Mar 19 '20 00:03 dextorious

Thanks, I'll have to update the README. cscherrer posted similar results using his 2950X on Discourse. On the master branch of LoopVectorization (a few things have improved, but dot product performance should be identical*):

julia> BenchmarkTools.DEFAULT_PARAMETERS.samples = 1e12
1.0e12

julia> a = rand(256); b = rand(256);

julia> @btime mydot($a,$b)
  12.158 ns (0 allocations: 0 bytes)
61.406705504849626

julia> @btime mydotavx($a,$b)
  13.501 ns (0 allocations: 0 bytes)
61.406705504849626

julia> @btime myselfdot($a)
  8.976 ns (0 allocations: 0 bytes)
85.74999141741742

julia> @btime myselfdotavx($a)
  9.435 ns (0 allocations: 0 bytes)
85.74999141741742

julia> a = rand(255); b = rand(255);

julia> @btime mydot($a,$b)
  36.302 ns (0 allocations: 0 bytes)
61.46306840897695

julia> @btime mydotavx($a,$b)
  14.292 ns (0 allocations: 0 bytes)
61.46306840897695

julia> @btime myselfdot($a)
  28.851 ns (0 allocations: 0 bytes)
78.1853694948555

julia> @btime myselfdotavx($a)
  9.828 ns (0 allocations: 0 bytes)
78.18536949485554

julia> versioninfo()
Julia Version 1.5.0-DEV.463
Commit 94b29d5e98* (2020-03-15 19:17 UTC)
Platform Info:
  OS: Linux (x86_64-generic-linux)
  CPU: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

(the chip is quad channel, but all the benchmarks I've included in the docs should fit in cache)

When the length is 256, we see essentially the same pattern: the single load version is about 50% faster. From Agner Fog's instruction tables (page 100), while Ryzen ymm vfmadd has half the throughput as xmm, it's latency is unchanged. Additionally, it looks like ymm-memory move instructions also have half throughput (reciprocal throughput of 1 instead of 0.5). So in terms of relative throughput of memory and fma instructions, these Ryzen chips look like they're the same as Haswell or newer Intel chips, while they will do better (have lower) latency * throughput of fma instructions, suggesting they don't need as much unrolling.

Docs contain plots of performance for all lengths from 2 through 256 on my computer: Stable docs Latest docs Contradicting my statement on identical performance, it looks like the latest plots regressed relative to stable. I'll have to look into that. Regular dot product peaked at under 44 GFLOPS, while the single load versions did at about 60 GFLOPS. A single core of this CPU has a theoretical peak of 131.2 GFLOPS (minus overhead from reducing the accumulation vectors, etc). MKL dgemm regularly exceeded 120 GFLOPS when the memory accesses are all aligned (when the stride is a multiple of 8).

Meaning it never actually hit >50% of the theoretical speed anyway. Even if Ryzen fmas had half the throughput relative to memory ops, would this significantly change the performance characteristics of dot products? I'm not sure. Given that the relative values seem to be about the same, I don't think there's any reason to model them differently. Let me know if you think there are any differences I should try to account for.

chriselrod avatar Mar 19 '20 15:03 chriselrod