LoopVectorization.jl Plot against theoretical peak performance

Plot against theoretical peak performance

Open certik opened this issue 2 years ago • 23 comments

I started with these plots:

https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_vector_ops/

The issue is that I can't tell from the plot if the performance is already the best you can do, or if there is still room for improvement. To answer that question, I like to plot clock cycles (per loop body / array element) on the vertical axis, and also plot the theoretical performance peak into the same graph. Let's do that for the very first benchmark:

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
           @turbo for i ∈ eachindex(𝐲)
               𝐲i = zero(eltype(𝐲))
               for j ∈ eachindex(𝐱)
                   𝐲i += 𝐀[i,j] * 𝐱[j]
               end
               𝐲[i] = 𝐲i
           end
       end
jgemvavx! (generic function with 1 method)

julia> N = 64
64

julia> y = rand(N);

julia> A = rand(N, N);

julia> x = rand(N);

julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  427.970 ns … 610.970 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     428.603 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   430.210 ns ±   9.385 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▇█▅▁                    ▃                                     ▂
  ████▅▄▄▄▃▄▅▄▅▅▅▅▆▄▅▄▃▁▃███▇▆▄▁▄▁▄▁▄▄▄▅▅▅▆▇▅▃▄▁▁▃▁▁▁▁▁▁▁▃▁▅▅▅▅ █
  428 ns        Histogram: log(frequency) by time        463 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 427.136e-9 * cpu_freq / N^2
0.3337

So for N=64, using double precision, it seems the body of the loop takes 0.333 clock cycles.

I use Apple M1. I installed the macOS Julia, I assume that is Intel based? Somehow it works for me.

Regarding the theoretical peak, the loop body is just:

                   𝐲i += 𝐀[i,j] * 𝐱[j]

Everything else I think gets amortized. So it is two memory reads from L1 cache, each takes 0.1665 clock cycles (ldr q0, [x1] takes 0.333). And one fma, which takes 0.125 (fmla.2d v0, v0, v0 takes 0.25). The bottleneck in this case is the memory read, total of 0.333 clock cycles. The fma runs at the same time, so we do not include it.

So according to my analysis, the code runs at 100% of the theoretical peak. Which sounds too good to be true. I am worried I made some mistake in my analysis somewhere. But I am posting what I have so far.

Nov 06 '21 14:11 certik

Using the same analysis, for N=128 I get

julia> 1.700e-6 * cpu_freq / N^2
0.33203125

For N=256:

julia> 10.083e-6 * cpu_freq / N^2
0.492333984375

The L1 cache is 320 KB, the L2 cache is 12 MB. No L3 cache. The N=128 matrix takes:

julia> 128^2 * 8 / 1024
128.0

128 KB, so it fits in L1. The N=256 matrix takes:

julia> 256^2 * 8 / 1024
512.0

So it doesn't completely fit.

My understanding is that this is double precision:

julia> typeof(A)
Matrix{Float64} (alias for Array{Float64, 2})

So I can't see any mistakes in my analysis. How can I see the assembly code that is actually being run? I tried:

julia> @code_native jgemvavx!(y, A, x)
	.section	__TEXT,__text,regular,pure_instructions
; ┌ @ REPL[52]:1 within `jgemvavx!'
	pushq	%r14
	pushq	%rbx
	subq	$56, %rsp
	movq	%rsi, %rbx
	xorps	%xmm0, %xmm0
	movaps	%xmm0, 16(%rsp)
	movaps	%xmm0, (%rsp)
	movq	$0, 32(%rsp)
	movq	%rsi, 48(%rsp)
	movabsq	$jl_get_ptls_states_fast, %rax
	callq	*%rax
	movq	%rax, %r14
	movq	$12, (%rsp)
	movq	(%rax), %rax
...
	movq	%rax, 32(%rsp)
	movq	%rbx, 24(%rsp)
	movq	%r10, 16(%rsp)
	movabsq	$"macro expansion;", %rax
	callq	*%rax
	movq	8(%rsp), %rax
	movq	%rax, (%r14)
	movabsq	$jl_system_image_data, %rax
	addq	$56, %rsp
	popq	%rbx
	popq	%r14
	retq

But it prints x86 assembly. And it doesn't print the actual function, just the driver it seems.

Nov 06 '21 14:11 certik

I found one worrying case:

julia> N=130
130

julia> y = rand(N); A = rand(N,N); x = rand(N);

julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.633 μs …  3.646 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.663 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.680 μs ± 82.119 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▅▇█▇▅▅▅▄▁▂▁▁                                              ▂
  ███████████████▇▆▆▄▃▆▅▄▂▃▃▄▄▄▂▄▄▄▄▄▄▄▅▃▅▄▄▄▄▄▅▄▄▄▃▄▅▄▅▆▆▇▆ █
  1.63 μs      Histogram: log(frequency) by time     2.09 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> 1.633e-6 * cpu_freq / N^2
0.309207100591716

It gets faster than the peak. That shouldn't be possible. Although I seem to be getting similar behavior with gfortran below, so perhaps the CPU frequency got higher temporarily.

Nov 06 '21 15:11 certik

I also tested GFortran:

program test_gemv
implicit none
integer, parameter :: N = 64, Niter = 100000, dp = kind(0.d0)
real(dp) :: A(N,N), x(N), y(N)
real(dp) :: t1, t2, cpu_freq
integer :: m

A = 1
x = 1

call cpu_time(t1)
do m = 1, Niter
    call gemv(y, A, x)
end do
call cpu_time(t2)
print *, y(1)
print *, (t2-t1)/Niter * 1e9_dp, "ns"
cpu_freq = 3.2e9_dp
print *, (t2-t1)/Niter / N**2 * cpu_freq, "clock cycles"

contains

    subroutine gemv(y, A, x)
    real(dp), intent(out) :: y(:)
    real(dp), intent(in) :: A(:,:), x(:)
    integer :: i, j, yi
    do i = 1, size(y)
        yi = 0
        do j = 1, size(x)
            yi = yi + A(j,i) * x(j)
        end do
        y(i) = yi
    end do
    end subroutine

end program

But it prints:

$ gfortran -O3 -ffast-math -funroll-loops a.f90
$ ./a.out 
   ...
   10.541718750000001      clock cycles

Which is way too slow, so I am not quite sure what is going on. I tried the intrinsic matmul:

--- a/a.f90
+++ b/a.f90
@@ -24,14 +23,7 @@ contains
     subroutine gemv(y, A, x)
     real(dp), intent(out) :: y(:)
     real(dp), intent(in) :: A(:,:), x(:)
-    integer :: i, j, yi
-    do i = 1, size(y)
-        yi = 0
-        do j = 1, size(x)
-            yi = yi + A(j,i) * x(j)
-        end do
-        y(i) = yi
-    end do
+    y = matmul(A, x)
     end subroutine
 
 end program

And that gives:

  N  clock cycles
  32 0.316
  64 0.298
 128 0.315
 256 0.417
 512 0.416
1024 0.442
2048 1.140

I don't trust these numbers yet. To make it reliable, I have to hook it into my benchmark driver, where I compile the kernel as a separate unit, and then just link it in the driver. That way I know for sure the compiler didn't make some optimizations with the iteration loop.

Found the mistake:

--- a/a.f90
+++ b/a.f90
@@ -24,7 +24,8 @@ contains
     subroutine gemv(y, A, x)
     real(dp), intent(out) :: y(:)
     real(dp), intent(in) :: A(:,:), x(:)
-    integer :: i, j, yi
+    integer :: i, j
+    real(dp) :: yi
     do i = 1, size(y)
         yi = 0
         do j = 1, size(x)

Now I am getting 0.499 for N=64.

Nov 06 '21 15:11 certik

I use Apple M1. I installed the macOS Julia, I assume that is Intel based? Somehow it works for me.

Yes. You can build Julia 1.7 and 1.8 from source for ARM on macOS, but there are a few issues. By far the most important issue is this. The SVD mentioned in the issue title is just one example. It is pervasive; segmentation faults are common. Many packages, including for example LoopVectorization.jl and Distributions.jl, cannot complete their tests on Mac ARM without segfaulting.

I installed an AArch64 Linux VM on my M1 Mac Mini, and LoopVectorization's tests passed following this commit.

Keno suspects the problem exists with Linux ARM as well (i.e. that the large code model is inappropriate for ARM generally), but that macOS reorganizes memory much more aggressively, resulting in the symptoms manifesting themselves much more frequently. I haven't used Julia in the Linux VM enough to see a segfault, and Julia is tested on Linux AArch64.

Keno bought a really nice 16 inch MacBook Pro earlier this week, so maybe he'll get the small code model to work.

But this is all so far out of my expertise, I only have the roughest idea of what large vs small code model even means, that I can't comment much.

If you're using Rosetta, you'll be a) missing out on FMA instructions (NEON has them). b) only have 16 named vector registers (NEON has 32). because Rosetta acts like a Westmere CPU.

You'll of course notice that the assembly you did see was all x64.

I'm still working on a reply w/ respect to the actual gemv.

Nov 06 '21 20:11 chriselrod

Thanks @chriselrod, that explains everything regarding the current status of the tooling. I realize the the M1 might not be ideal yet for Julia. I really like it for benchmarking otherwise, as it gives reliable performance, the CPU always runs at 3.2GHz, etc. I also like it for development, as it compiles really quickly.

Nov 06 '21 20:11 certik

On the Apple M1, this is what I get, Rosetta:

julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  414.990 ns … 465.241 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     415.829 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   416.694 ns ±   3.688 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

    █▆                                                           
  ▂▆██▆▄▃▂▂▂▁▂▂▁▁▂▁▁▂▁▂▁▁▂▁▁▂▂▁▁▁▁▁▂▂▁▂▂▂▂▂▂▂▂▂▁▂▁▂▁▂▁▁▁▂▂▂▂▂▃▂ ▂
  415 ns           Histogram: frequency by time          430 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, westmere)

Note the use of $. This causes the benchmark to treat the arguments as local variables instead of globals.

Running Julia natively might be unreliable, as described above, but it's reliable enough for a few simple benchmarks:

julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 308 evaluations.
 Range (min … max):  272.997 ns … 357.143 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     273.539 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   274.313 ns ±   3.552 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▇█▅▂                             ▁▃▃                         ▂
  █████▄▄▄▃▁▁▁▁▁▁▁▁▁▃▃▃▃▄▃▄▄▄▁▄▁▁▃▁▆███▆▁▄▃▃▃▁▄▁▃▁▁▁▁▃▁▄▄▄▅▆▆▆▆ █
  273 ns        Histogram: log(frequency) by time        289 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 273e-9 * cpu_freq / N^2
0.21328125

julia> versioninfo()
Julia Version 1.8.0-DEV.901
Commit 47255f9fe8* (2021-11-06 19:56 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
Environment:
  JULIA_NUM_THREADS = 4

So 0.213 cycles/iteration.

AArch64 assembly (I'll describe how to get it in my next response), should explain how this is possible:

L60:
	ldr	d16, [x3, x10, lsl #3]
	ldp	q17, q18, [x11]
	add	x10, x10, #1                    ; =1
	cmp	x10, x1
	fmla	v0.2d, v17.2d, v16.d[0]
	ldp	q19, q17, [x11, #32]
	fmla	v1.2d, v18.2d, v16.d[0]
	fmla	v2.2d, v19.2d, v16.d[0]
	ldp	q18, q19, [x11, #64]
	fmla	v3.2d, v17.2d, v16.d[0]
	fmla	v4.2d, v18.2d, v16.d[0]
	ldp	q17, q18, [x11, #96]
	fmla	v5.2d, v19.2d, v16.d[0]
	add	x11, x11, x5
	fmla	v6.2d, v17.2d, v16.d[0]
	fmla	v7.2d, v18.2d, v16.d[0]
	b.lt	L60

One iteration of the loop featured 1x load with immediate offset (1 load) 4x load pairs (8 loads) 8x fma

The 8 fmas are of course neon fmas, meaning they operate on 128 bit registers. Thus these 8 fmas perform 16 loop iterations.

Nov 06 '21 20:11 chriselrod

Cool, I can reproduce your numbers using Rosetta:

julia> N=64
64

julia> y = rand(N); A = rand(N,N); x = rand(N);

julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  427.764 ns … 693.467 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     436.347 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   437.712 ns ±   9.054 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▂              ▅█▄▄▁                      ▃                   ▁
  █▇▃▄▁▁▁▁▁▁▁▁▁▁▁█████▇▃▁▁▃▁▅▄▄▄▅▅▅▃▅▃▁▃▃▄▅▇█▇▅▇▅▃▃▄▅▁▃▃▁▄▆▅▅▆▇ █
  428 ns        Histogram: log(frequency) by time        460 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
 Range (min … max):  417.714 ns … 680.905 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     418.342 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   421.108 ns ±   8.790 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃                ▂▆▄▁            ▃                   ▁     ▂
  ████▇▅▄▄▃▄▄▃▁▁▃▃▃▃▅▁████▆▃▃▃▄▄▄▃▄▃▆██▇▆▇▅▃▅▅▄▁▅▁▃▄▃▁▁▃▅▆█▆▅▄▇ █
  418 ns        Histogram: log(frequency) by time        442 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

I can't get it below 418ns, but that's close enough to yours 415ns.

I am analyzing your arm assembly loop now.

Nov 06 '21 20:11 certik

Thanks @chriselrod, that explains everything regarding the current status of the tooling. I realize the the M1 might not be ideal yet for Julia. I really like it for benchmarking otherwise, as it gives reliable performance, the CPU always runs at 3.2GHz, etc. I also like it for development, as it compiles really quickly.

The M1 Native on Julia is reliable enough for light benchmarking (and if you can setup an AArch64 Linux VM/happen to already have one, that seems fine in practice). Building Julia from source is fairly easy if you already have compiler tools. Most binary dependencies should be downloaded (I believe even on Apple M1, but it's been while -- maybe it compiled them some time ago and I've forgotten), and from there compilation is very fast, as you note. Something like 50% faster than any of my x64 machines. Of course, if you follow that, you'd need at least Julia 1.7 for things to work at all. I'd just use master.

EDIT: FWIW, my M1 is on a Mini. Maybe it is better cooled/able to clock slightly higher? Not that cooling is normally an issue for the M1.

Nov 06 '21 20:11 chriselrod

I use the M1 MacBook Pro. The fan never turns on. The 8 core M1 laptop is about as fast for C++ compilation as my beefy 48 core Intel machine.

Does the assembly code you provided correspond to this inner loop?

               for j ∈ eachindex(𝐱)
                   𝐲i += 𝐀[i,j] * 𝐱[j]
               end

I see 8 fmas, each operates on two doubles, so it is unrolled 16 times. Each fma takes 0.25 cycles. So the cost of just one loop body is 0.125 for the single double fma operation. I see 4 ldp, each loads two doubles, so it loads 8 doubles. Shouldn't there be 16 loads? I assume those are loads for x[j]. What about A[i,j] (I assume that is the ldr load)? I would expect another 16 single double loads.

In Julia, the last index is continuous?

Nov 06 '21 20:11 certik

In Julia, the last index is continuous?

The first index, like Fortran.

The issue is that I can't tell from the plot if the performance is already the best you can do, or if there is still room for improvement

It's hard to say. If we only look at the execution units, the CPU I ran on is capable of about 120 GFLOPS/core, depending on the clock speeds I have it set at. For these benchmarks, I think it was 3.7 or 3.8 GHz, which would put it in the area of 118-122 peak theoretical GFLOP if only looking at the FMA units. The matrix-multiply plots' peaks reached roughly these values, so that's probably about right.

Of course, only gemm comes close to being limited by execution units. Everything else is limited by memory, so focusing only on GFLOPS is perhaps an unrealistic target of "what we wish we could achieve."

I do like the view of cycles/loop iteration though.

The L1 cache is 320 KB

The data cache however is "only" 192 KiB. Compared to the 32 KiB of most Intel/AMD CPUs. Intel Ice Lake and newer have 48 KiB.

So I can't see any mistakes in my analysis. How can I see the assembly code that is actually being run? I tried:

I use Cthulhu.jl. LoopVectorization currently works by compiling a function called _turbo_!. This function often isn't inlined.

Cthulhu lets you descend into a chain of function calls.

julia> using LoopVectorization, Cthulhu

julia> function jgemvavx!(𝐲, 𝐀, 𝐱)
         @turbo for i ∈ eachindex(𝐲)
           𝐲i = zero(eltype(𝐲))
           for j ∈ eachindex(𝐱)
             𝐲i += 𝐀[i,j] * 𝐱[j]
           end
           𝐲[i] = 𝐲i
         end
       end
jgemvavx! (generic function with 1 method)

julia> N = 64
64

julia> y = rand(N);

julia> A = rand(N, N);

julia> x = rand(N);

julia> @descend jgemvavx!(y, A, x)

You should now see something a little like:

  36 ─ %48 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(4)))::Ptr{Float64}
  │    %49 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(2)))::Ptr{Float64}
  │    %50 = $(Expr(:gc_preserve_begin, Core.Argument(3), Core.Argument(4), Core.Argument(2)))
  │          invoke LoopVectorization._turbo_!($(QuoteNode(Val{(false, 0, 0, 0, false, 8, 64, 32, 64, 32768, 1048576, 20185088, 0x0000000000000001)}()))::Val{(false, 0, 0, 0, false, 8, 64, 32, 64, 32768, 1048576, 20185088, 0x0000000000000001)}, $(QuoteNode(Val{(:numericconstant, Symbol("###zero###5###"), LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0003, 0x02), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000200030001, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0004, 0x03))}()))::Val{(:numericconstant, Symbol("###zero###5###"), LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0003, 0x02), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000200030001, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0004, 0x03))}, $(QuoteNode(Val{(LoopVectorization.ArrayRefStruct{:𝐀, Symbol("##vptr##_𝐀")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000102, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:𝐱, Symbol("##vptr##_𝐱")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:𝐲, Symbol("##vptr##_𝐲")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}()))::Val{(LoopVectorization.ArrayRefStruct{:𝐀, Symbol("##vptr##_𝐀")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000102, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:𝐱, Symbol("##vptr##_𝐱")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:𝐲, Symbol("##vptr##_𝐲")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, $(QuoteNode(Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}()))::Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, $(QuoteNode(Val{(:i, :j)}()))::Val{(:i, :j)}, $(QuoteNode(Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1,), (1,)), ((1, 2), (3,), (4,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{4, Static.StaticInt{0}}}}}}()))::Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1,), (1,)), ((1, 2), (3,), (4,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{4, Static.StaticInt{0}}}}}}, %4::Int64, %22::Int64, %37::Ptr{Float64}, %48::Ptr{Float64}, %49::Ptr{Float64}, %41::Int64)::Any
  │          $(Expr(:gc_preserve_end, :(%50)))   │
  └───       return nothing                      │
Select a call to descend into or ↩ to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [h]ide type-stable statements, [d]ebuginfo, [r]emarks, [i]nlining costs, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
 • %51  = invoke _turbo_!(::Val{…},::Val{…},::Val{…},::Val{…},::Val{…},::Val{…},::Int64{…},::Int64{…},::Ptr{…},::Ptr{…},::Ptr{…},::Int64{…})::Any
   ↩

If there were multiple non-inlined functions, you could nagivate the dot currently at %51 up and down to select whichever one you chose. Currently, the only two navigation options are %51 to descend into _turbo_! and ↩ to ascend.

There are also a lot of other options listed under toggles, show, and actions. E.g., you can turn Julia level optimizations on/off with o, highlight type instabilities with w, etc. L and N let us look at the LLVM and Native code.

So by pressing [enter] with the dot on %51 = invoke _turbo_!, and hitting N, I get something like (I actually had to hit [d] twice to toggle off the debug information, and then N again to return to the native code view):

L96:
        vbroadcastsd    zmm8, qword ptr [rcx + 8*rbx]
        vfmadd231pd     zmm0, zmm8, zmmword ptr [rax] # zmm0 = (zmm8 * mem) + zmm0
        vfmadd231pd     zmm1, zmm8, zmmword ptr [rax + 64] # zmm1 = (zmm8 * mem) + zmm1
        vfmadd231pd     zmm2, zmm8, zmmword ptr [rax + 128] # zmm2 = (zmm8 * mem) + zmm2
        vfmadd231pd     zmm3, zmm8, zmmword ptr [rax + 192] # zmm3 = (zmm8 * mem) + zmm3
        vfmadd231pd     zmm4, zmm8, zmmword ptr [rax + 256] # zmm4 = (zmm8 * mem) + zmm4
        vfmadd231pd     zmm5, zmm8, zmmword ptr [rax + 320] # zmm5 = (zmm8 * mem) + zmm5
        vfmadd231pd     zmm6, zmm8, zmmword ptr [rax + 384] # zmm6 = (zmm8 * mem) + zmm6
        vfmadd231pd     zmm7, zmm8, zmmword ptr [rax + 448] # zmm7 = (zmm8 * mem) + zmm7
        inc     rbx
        add     rax, r9
        cmp     rbx, rsi
        jl      L96

This is similar to the ARM assembly above. Load from x once, and then 8 loads from A, using them each for an fma.

This is the approach it takes: Unroll (and SIMD) the i loop aggressively, because you can reuse a load from x for every iteration of the i loop.

FWIW, that gets me

julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 852 evaluations.
 Range (min … max):  141.597 ns … 214.688 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     142.251 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   142.467 ns ±   1.163 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▁▄▅▇█▆▄▂
  ▂▂▂▃▄▆████████▇▅▄▃▃▂▂▂▂▂▁▁▂▂▂▂▃▃▃▄▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂ ▃
  142 ns           Histogram: frequency by time          145 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.6e9; # set in the bios

julia> 141.597e-9 * cpu_freq / N^2
0.12445048828125

Nov 06 '21 20:11 chriselrod

The zmm register is 512 bit, so that can hold 512/64 = 8 doubles. What is the reciprocal throughput of the vfmadd231pd instruction on your platform? It seems that is 1 cycle for IceLake. So that is 0.125 per double, as fast as the M1. However your instruction also reads from memory, I am not sure what the cost is there. Assuming the theoretical peak is 0.125 cycles. You are measuring at 0.1245 clock cycles per double, so it is running essentially at 100% peak?

Nov 06 '21 20:11 certik

I use the M1 MacBook Pro. The fan never turns on. The 8 core M1 laptop is about as fast for C++ compilation as my beefy 48 core Intel machine.

This would definitely be true for building Julia. Almost all of the parallelizable parts of building Julia (e.g., LLVM, OpenBLAS) have been replaced with binary downloads. This leaves the C/C++ within Julia itself, and then the majority remaining is single threaded.

Does the assembly code you provided correspond to this inner loop?

Yes, although LV will often rearrange loops.

I see 8 fmas, each operates on two doubles, so it is unrolled 16 times. Each fma takes 0.25 cycles. So the cost of just one loop body is 0.125 for the single double fma operation. I see 4 ldp, each loads two doubles, so it loads 8 doubles.

I'm pretty sure the ldps are loading into q registers, which are 128 bits on ARMv8, and not 64 like in x64. An x64 word is 16 bits ,vs 32 bits on ARM, so the double/quadwords of ARM are also twice as large.

Thus each ldp is loading two 128 bit quadwords, which are then reinterpreted as vectors of 2x Float64 by the fma. The ldps are loading from the matrix A.

The ldr must be used as an x86 broadcast here, loading from x.

So basically, broadcast one element from x, and then reuse it for 16 elements of A, before moving on to the next iteration.

What is the reciprocal throughput of the vfmadd231pd instruction on your platform? It seems that is 1 cycle for IceLake.

0.5 cycles. My platform is Cascadelake, but I also have a Tiger lake, which is 1 cycle (like IceLake). So I think it is only running at about 1/2 peak.

Nov 06 '21 21:11 chriselrod

However your instruction also reads from memory, I am not sure what the cost is there.

https://uops.info/html-instr/VFMADD231PD_ZMM_ZMM_M512.html Theoretically 0.5 "calculated from port usage", but measured at 0.55. However, this is thanks to parallelism of the microops. This one will compete with other loads.

Nov 06 '21 21:11 chriselrod

uica is a great tool for Intel architectures. It estimates that the inner loop should take 4.5 cycles on Cascade Lake, or 8 on Tiger Lake.

These correspond to about 0.07 and 0.125 cycles/iteration, meaning I am well behind it's theoretical prediction here.

Nov 06 '21 21:11 chriselrod

I also tested GFortran:

Here is the Fortran code I used for the benchmarks. Simple improvements/corrections are welcome, but the point of course is seeing how well the compilers transform loops, so improvements shouldn't be applying them, but encouraging the compiler to do so.

This file is the code I used to compile them (you can see the flags there, which I did try to tune), as well as the wrappers to call and benchmark them from Julia. Specifically, I used

if (Sys.ARCH === :x86_64)
    run(`gfortran -Ofast -march=native -funroll-loops -mprefer-vector-width=$(8REGISTER_SIZE) -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 -shared -fPIC $ffile -o $LIBFTEST`)
else
    run(`gfortran -Ofast -march=native -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 -shared -fPIC $ffile -o $LIBFTEST`)
end

Which is probably already getting fancier with the flags than most, but if there's more that can be done to help, I'm open to it. Flags like -floop-nest-optimize seemed to make performance worse, if they had any impact at all. Intel flags:

ifort -fast -qopt-zmm-usage=high -qoverride-limits -shared -fPIC $ffile -o $LIBIFTEST

Options like -qopt-zmm-usage=high (and the mprefer-vector-width from Fortran) of course reflect that I was running these benchmarks on a system with AVX512.

Nov 06 '21 21:11 chriselrod

Ok, so the vectorization goes roughly like this if I understand it correctly. You start from:

    real(dp) :: yi
    do i = 1, size(y)
        yi = 0
        do j = 1, size(x)
            yi = yi + A(i,j) * x(j)
        end do
        y(i) = yi
    end do

Then you vectorize the outer loop:

    real(dp) :: yi(16)  
    do i = 1, size(y), 16
        do k = 1, 16
            yi = 0
            do j = 1, size(x)
                yi(k) = yi(k) + A(i+k-1,j) * x(j)
            end do
            y(i+k-1) = yi(k)
        end do
    end do

Then you move the k loop into the inner loop:

    real(dp) :: yi(16)  
    do i = 1, size(y), 16
        yi = 0    
        do j = 1, size(x)    
            do k = 1, 16
                yi(k) = yi(k) + A(i+k-1,j) * x(j)
            end do
        end do    
        y(i:i+15) = yi         
    end do

I haven't tested it, hopefully I didn't make a mistake. Now the inner loop is:

                yi(k) = yi(k) + A(i+k-1,j) * x(j)

And the cost is 1 fma, and 1 memory read (R). For now we assume everything else can gets amortized (if not, we'll have to add it to the peak).

On M1, the speed is (in clock cycles per double):

fma: 0.125
R: 0.1665

So the bottleneck is memory read, and the theoretical performance peak should be 0.1665. You are getting 0.21328125, so that is 78% peak. That is very impressive.

Btw, the last manually unrolled version runs at 0.305 clock cycles with gfortran, essentially as fast as the intrinsic matmul. Here is the assembly code for the inner loop:

L5:
	mov	x15, x16
	ldr	d31, [x17, 56]
	ldp	d3, d26, [x17, -8]
	ldp	d30, d29, [x17, 8]
	ldr	d8, [x15], 8
	ldp	d28, d27, [x17, 24]
	fmadd	d2, d8, d26, d2
	fmadd	d23, d8, d29, d23
	ldp	d26, d29, [x17, 64]
	fmadd	d25, d3, d8, d25
	ldp	d0, d3, [x17, 40]
	fmadd	d24, d8, d30, d24
	fmadd	d30, d8, d26, d17
	fmadd	d22, d8, d28, d22
	ldp	d26, d17, [x17, 104]
	fmadd	d21, d8, d27, d21
	fmadd	d20, d8, d0, d20
	ldr	d28, [x17, 80]
	ldr	d0, [x17, 88]
	fmadd	d31, d8, d31, d18
	ldr	d27, [x17, 96]
	fmadd	d29, d8, d29, d16
	fmadd	d28, d8, d28, d7
	fmadd	d19, d8, d3, d19
	fmadd	d26, d8, d26, d4
	fmadd	d3, d8, d0, d6
	fmadd	d27, d8, d27, d5
	fmadd	d1, d8, d17, d1
	ldr	d0, [x16, 8]
	add	x16, x15, 8
	ldr	d5, [x17, 520]
	ldr	d7, [x17, 536]
	ldr	d8, [x17, 576]
	fmadd	d24, d0, d5, d24
	ldr	d4, [x17, 528]
	fmadd	d22, d0, d7, d22
	ldr	d5, [x17, 560]
	fmadd	d17, d0, d8, d30
	ldr	d18, [x17, 568]
	ldr	d16, [x17, 584]
	fmadd	d23, d0, d4, d23
	ldr	d7, [x17, 592]
	fmadd	d19, d0, d5, d19
	ldr	d8, [x17, 608]
	fmadd	d18, d0, d18, d31
	fmadd	d16, d0, d16, d29
	ldr	d6, [x17, 504]
	fmadd	d7, d0, d7, d28
	ldr	d31, [x17, 512]
	fmadd	d5, d0, d8, d27
	ldr	d30, [x17, 544]
	ldr	d29, [x17, 552]
	fmadd	d25, d6, d0, d25
	ldr	d28, [x17, 600]
	fmadd	d2, d0, d31, d2
	ldr	d4, [x17, 616]
	fmadd	d21, d0, d30, d21
	ldr	d27, [x17, 624]
	fmadd	d20, d0, d29, d20
	fmadd	d6, d0, d28, d3
	add	x17, x17, 1024
	fmadd	d4, d0, d4, d26
	fmadd	d1, d0, d27, d1
	cmp	x16, x19
	bne	L5

Nov 06 '21 21:11 certik

I also tried to remove the -funroll-loops flag from gfortran, now I get:

L5:
	ldp	d28, d27, [x0, -8]
	ldp	d26, d25, [x0, 8]
	ldr	d0, [x2], 8
	fmadd	d24, d28, d0, d24
	fmadd	d23, d0, d27, d23
	fmadd	d22, d0, d26, d22
	fmadd	d21, d0, d25, d21
	ldp	d28, d27, [x0, 24]
	ldp	d26, d25, [x0, 40]
	fmadd	d20, d0, d28, d20
	fmadd	d19, d0, d27, d19
	ldp	d28, d27, [x0, 56]
	fmadd	d18, d0, d26, d18
	fmadd	d17, d0, d25, d17
	ldp	d26, d25, [x0, 72]
	fmadd	d16, d0, d28, d16
	fmadd	d7, d0, d27, d7
	ldp	d28, d27, [x0, 88]
	fmadd	d6, d0, d26, d6
	fmadd	d5, d0, d25, d5
	ldp	d26, d25, [x0, 104]
	fmadd	d4, d0, d28, d4
	fmadd	d3, d0, d27, d3
	add	x0, x0, 512
	fmadd	d2, d0, d26, d2
	fmadd	d1, d0, d25, d1
	cmp	x2, x19
	bne	L5

This runs at 0.2946 clock cycles.

These are not vector instructions are they? I wonder if these operate on one double at a time.

Nov 06 '21 21:11 certik

Tiger Lake makes it very easy to get a high percentage of peak, because it is very bottlenecked by lack of (FMA) execution resources:

julia> using LinuxPerf

julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)
foreachf (generic function with 1 method)

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(jgemvavx!, 1_000_000, y, A, x)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               5.20e+08  100.0%  #  4.4 cycles per ns
└ task-clock               1.19e+08  100.0%  # 119.2 ms
┌ instructions             9.21e+08  100.0%  #  1.8 insns per cycle
│ branch-instructions      7.20e+07  100.0%  #  7.8% of instructions
└ branch-misses            1.00e+06  100.0%  #  1.4% of branch instructions
┌ L1-dcache-load-misses    9.60e+03  100.0%  #  0.0% of dcache loads
│ L1-dcache-loads          5.96e+08  100.0%
│ cache-misses             6.85e+02  100.0%  # 79.3% of cache references
└ cache-references         8.64e+02  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 5.2e8 / (1_000_000 * N^2)
0.126953125

I switched to using LinuxPerf.jl here because my laptop's clock speed is erratic, but I could of course also disable the turbo, which should pin it at roughly 2.8 GHz.

However, LinuxPerf also lets us calculate cycles/iteration more directly: we can just run it a million times, and then compare the number of clock cycles with N^2. LinuxPerf is of course Linux only. It's probably x86-only too (given it needs to run performance counters? I'd have to check if it supports ARM).

Interestingly, we hit 0.1269, meaning we're at around 98% of the theoretical peak!

Tiger Lake is a big upgrade over Skylake-X/Cascade lake cores in most respects, except that it only has a single fma unit. Thus, if Skylake-X/Cascade Lake can reach or exceed 1 FMA/cycle, the only reason Tiger Lake couldn't is because it is missing the extra unit. Being so starved on execution, anything bottlenecked by SIMD FMA makes it easy to hit the theoretical peak on Ice/Tiger/Rocket lake.

Nov 06 '21 21:11 chriselrod

Ok, so the vectorization goes roughly like this if I understand it correctly. You start from: And the cost is 1 fma, and 1 memory read (R). For now we assume everything else can gets amortized (if not, we'll have to add it to the peak).

Yes, that's exactly right. The yi(k) and x(j) are of course hoisted out. We can't totally amortize the x(j), because we're only getting 16 reuses here, but it goes a long way to ameliorating the 2 load-per-fma situation we have with the code as written.

These are not vector instructions are they? I wonder if these operate on one double at a time.

You're right. Try -march=armv8.5-a, although -march=armv8-a should be enough.

Nov 06 '21 21:11 chriselrod

Interesting. Yes, different platforms have different pros and cons, but plotting the clock cycles and also the peak into the same graph makes this easy to see how well each compiler is doing. The flops are not that very informative, especially since many of these benchmarks (depending on the platform!) are memory read limited.

I also tried lfortran --fast a.f90 on the manually unrolled code. I get 1.328 clock cycles. ;) The assembly is a mess, I think this is the relevant part:

LBB7_5:
	ldr	w10, [sp, #176]
	add	w8, w24, w30
	add	w23, w2, w9
	ldr	d0, [x28, w8, sxtw #3]
	add	w8, w23, #1
	ldr	d1, [x27, w8, sxtw #3]
	sub	w8, w13, w10
	sbfiz	x8, x8, #3, #32
	ldr	d2, [x11, x8]
	fmul	d1, d1, d0
	add	w10, w23, #2
	add	w30, w30, #1
	fadd	d1, d2, d1
	str	d1, [x11, x8]
	ldr	w8, [sp, #176]
	ldr	d1, [x27, w10, sxtw #3]
	add	w10, w23, #3
	cmp	w30, w25
	sub	w8, w14, w8
	sbfiz	x8, x8, #3, #32
	ldr	d2, [x11, x8]
	fmul	d1, d1, d0
	add	w9, w9, w26
	fadd	d1, d2, d1
	str	d1, [x11, x8]
	ldr	w8, [sp, #176]
	ldr	d1, [x27, w10, sxtw #3]
	add	w10, w23, #4
	sub	w8, w15, w8
	sbfiz	x8, x8, #3, #32
	ldr	d2, [x11, x8]
	fmul	d1, d1, d0
	fadd	d1, d2, d1
...

It's not using fmas. That's because I didn't figure out how to force the --fast-math flag to LLVM from C++... One can use a workaround:

lfortran --show-llvm a.f90 > a.ll
clang -O3 -ffast-math -funroll-loops -c a.ll -o a.o
lfortran -o a.out a.o

Still 1.32 cycles and no fmas. Not sure what is going on.

Nov 06 '21 22:11 certik

Try -march=armv8.5-a, although -march=armv8-a should be enough.

Ah, that's how you do it. I was always doing:

$ clang -O3 -march=native -ffast-math -funroll-loops -c a.ll -o a.o
clang-11: error: the clang compiler does not support '-march=native'

But this works:

clang -O3 -march=armv8.5-a -ffast-math -funroll-loops -c a.ll -o a.o

Still no fma. I also tried the same with gfortran, but no change either.

I think LFortran might generate LLVM for the indexing operations in a way that is hard to optimize for LLVM.

Nov 06 '21 22:11 certik

I always thought both MKL and OpenBLAS have the state of the art matrix-vector multiply. But I can't find their lines in the plot. But they don't seem to be doing well at all. Why is that?

My understanding is that Goto wrote it by hand in assembly for Pentium 4. I don't know if this is close to the original code:

https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n.S

It looks really complicated. Do you know what vectorization strategy he used? Due to the lack of fma, he had to very carefully hide the latency of the mul and then add, the older CPUs were not as good at reordering instructions and executing them at the same time. However Pentium 4 already has xmm registers, so I would think the strategy might be similar as today.

Update: I found some details here: https://dl.acm.org/doi/10.1145/1356052.1356053, although mostly for gemm.

Nov 06 '21 23:11 certik

Now let's try C++ on the M1 for N=64.

First the original benchmark, as written, takes 1.77428 cycles:

void gemv4(double* y, double* A, double* x){
  for (long m = 0; m < N; m++){
    y[m] = 0.0;
  }
  for (long i = 0; i < N; i++){
    for (long j = 0; j < N; j++){
      y[i] += A[i + j*N] * x[j];
    }
  }
}

The manually unrolled loop is 0.24548 cycles:

void gemv5(double* y, double* A, double* x){
    for (long m = 0; m < N; m++){
        y[m] = 0.0;
    }
    double yi[16];
    for (long i = 0; i < N; i+=16) {
        for (long k = 0; k < 16; k++) yi[k] = 0.0;
        for (long j = 0; j < N; j++) {
            for (long k = 0; k < 16; k++) {
                yi[k] += A[i+k + j*N] * x[j];
            }
        }
        for (long k = 0; k < 16; k++) y[i+k] = yi[k];
    }
}

If you transpose A, it takes 0.242369 clock cycles:

void gemv(double* y, double* A, double* x){
  for (long m = 0; m < N; m++){
    y[m] = 0.0;
  }
  for (long k = 0; k < N; k++){
    for (long m = 0; m < N; m++){
      y[m] += A[m + k*N] * x[k];
    }
  }
}

That's your code from: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/134a95c84aacd74cfffae484918fd179e93fb9d9/benchmark/looptests.c#L154

I used the native Clang:

clang++ -std=c++17 -O3 -march=armv8.5-a -ffast-math d.cpp

The assembly for the gemv5 version:

LBB4_1:                                 ; =>This Inner Loop Header: Depth=1
        ldr     d16, [x2, x8]
        ldp     q17, q18, [x9, #-64]
        fmla.2d v0, v17, v16[0]
        fmla.2d v1, v18, v16[0]
        ldp     q17, q18, [x9, #-32]
        fmla.2d v2, v17, v16[0]
        fmla.2d v3, v18, v16[0]
        ldp     q17, q18, [x9]
        fmla.2d v4, v17, v16[0]
        fmla.2d v5, v18, v16[0]
        ldp     q17, q18, [x9, #32]
        fmla.2d v6, v17, v16[0]
        fmla.2d v7, v18, v16[0]
        add     x8, x8, #8                      ; =8
        add     x9, x9, #512                    ; =512
        cmp     x8, #512                        ; =512
        b.ne    LBB4_1

(I hope this is the right one.) It seems equivalent to the assembly from LV you posted above.

Nov 07 '21 00:11 certik

LoopVectorization.jl LoopVectorization.jl copied to clipboard

Plot against theoretical peak performance

LoopVectorization.jl
LoopVectorization.jl copied to clipboard