LoopVectorization.jl icon indicating copy to clipboard operation
LoopVectorization.jl copied to clipboard

Matrix Multiplication benchmark analysis

Open certik opened this issue 2 years ago • 9 comments

See #355 for the background. I am now applying the same methodology to the matrix multiplication benchmark:

  • https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_multiplication/

I modified the kernel to not zero C, to simplify things a bit:

julia> N = 64
64

julia> A = rand(N,N); B = rand(N,N); C = rand(N,N);

julia> function A_mul_B2!(C, A, B)
           @turbo for n ∈ indices((C,B), 2), m ∈ indices((C,A), 1)
               for k ∈ indices((A,B), (2,1))
                   C[m,n] += A[m,k] * B[k,n]
               end
           end
       end
A_mul_B2! (generic function with 1 method)

julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.708 μs …  47.333 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     24.250 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   24.294 μs ± 717.909 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▂  ▁                        ▂  ▆   █  ▆  ▄  ▃  ▁       ▂
  ▃▁▁█▁▁█▁▁█▁▁▇▁▁▅▁▁▁▄▁▁▄▁▁▃▁▁▁▁▁▁▁▁█▁▁█▁▁▁█▁▁█▁▁█▁▁█▁▁█▁▁▆▁▁▄ █
  23.7 μs       Histogram: log(frequency) by time      24.5 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 23.708e-6 * cpu_freq / N^3
0.289404296875

In C++ I tested two kernels:

void gemm_mnk(double* C, double* A, double* B, long M, long K, long N){
  for (long m = 0; m < M; m++){
    for (long n = 0; n < N; n++){
      for (long k = 0; k < K; k++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

void gemm_nkm(double* C, double* A, double* B, long M, long K, long N){
   for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }
  return;
}

For N=64, I am getting:

gemm_mnk: 1.78319
gemm_nkm: 0.21441

Aren't these two kernels equivalent? They seem to operate on the same matrix layout and return exactly the same answer, they just differ in loop ordering.

The inner loop from gemm_nkm as generated by Clang seems to be:

LBB7_5:                                 ;   Parent Loop BB7_3 Depth=1
                                        ;     Parent Loop BB7_4 Depth=2
                                        ; =>    This Inner Loop Header: Depth=3
	ldr	d0, [x13, x14]
	ldp	q1, q2, [x4, #-256]
	fmla.2d	v10, v1, v0[0]
	fmla.2d	v9, v2, v0[0]
	ldp	q1, q2, [x4, #-224]
	fmla.2d	v8, v1, v0[0]
	fmla.2d	v31, v2, v0[0]
	ldp	q1, q2, [x4, #-192]
	fmla.2d	v30, v1, v0[0]
	fmla.2d	v29, v2, v0[0]
	ldp	q1, q2, [x4, #-160]
	fmla.2d	v28, v1, v0[0]
	fmla.2d	v27, v2, v0[0]
	ldp	q1, q2, [x4, #-128]
	fmla.2d	v26, v1, v0[0]
	fmla.2d	v25, v2, v0[0]
	ldp	q1, q2, [x4, #-96]
	fmla.2d	v24, v1, v0[0]
	fmla.2d	v23, v2, v0[0]
	ldp	q1, q2, [x4, #-64]
	fmla.2d	v22, v1, v0[0]
	fmla.2d	v21, v2, v0[0]
	ldp	q1, q2, [x4, #-32]
	fmla.2d	v20, v1, v0[0]
	fmla.2d	v19, v2, v0[0]
	ldp	q1, q2, [x4]
	fmla.2d	v18, v1, v0[0]
	fmla.2d	v17, v2, v0[0]
	ldp	q1, q2, [x4, #32]
	fmla.2d	v16, v1, v0[0]
	fmla.2d	v7, v2, v0[0]
	ldp	q1, q2, [x4, #64]
	fmla.2d	v6, v1, v0[0]
	fmla.2d	v5, v2, v0[0]
	ldp	q1, q2, [x4, #96]
	fmla.2d	v4, v1, v0[0]
	ldp	q1, q3, [sp, #96]               ; 32-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	str	q1, [sp, #96]                   ; 16-byte Folded Spill
	ldp	q1, q2, [x4, #128]
	fmla.2d	v3, v1, v0[0]
	ldr	q1, [sp, #128]                  ; 16-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	stp	q3, q1, [sp, #112]              ; 32-byte Folded Spill
	ldp	q1, q2, [x4, #160]
	fmla.2d	v12, v1, v0[0]
	fmla.2d	v11, v2, v0[0]
	ldp	q1, q2, [x4, #192]
	fmla.2d	v13, v1, v0[0]
	fmla.2d	v15, v2, v0[0]
	ldp	q1, q2, [x4, #224]
	fmla.2d	v14, v1, v0[0]
	ldr	q1, [sp, #144]                  ; 16-byte Folded Reload
	fmla.2d	v1, v2, v0[0]
	str	q1, [sp, #144]                  ; 16-byte Folded Spill
	add	x14, x14, #8                    ; =8
	add	x4, x4, #512                    ; =512
	cmp	x14, #512                       ; =512
	b.ne	LBB7_5

certik avatar Nov 07 '21 19:11 certik

Theoretical Peak Performance

The Clang results hint that gemm_nkm is probably exactly written to be optimal and to vectorize over m:

   for (long n = 0; n < N; n++){
    for (long k = 0; k < K; k++){
      for (long m = 0; m < M; m++){
        C[m + n*M] += A[m + k*M] * B[k + n*K];
      }
    }
  }

In this inner loop, vectorized over m, we have one memory write (C[m + n*M]), one memory read (A[m + k*M]) and one fma. Everything else is a constant. On M1: memory write is 0.25 per double, memory read is 0.1666 per double. fma is 0.125 per double. So if we only had the inner loop, the bottleneck would be memory write. However, the writes to C[m + n*M] can be reused in the loop over k. The same with memory reads, always only two loops out of three access it, but the fma is used by all the loops.

Another way to look at it is that the complexity is O(N^3), we have N^3 fma, but only N^2 memory writes (C), 2*N^2 memory reads (A and B).

So the bottleneck is actually only fma for large enough N.

The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.

certik avatar Nov 07 '21 19:11 certik

The theoretical peak performance for the loop body on M1 is thus 0.125 clock cycles per double.

I get fairly close to this on the M1:

julia> @benchmark A_mul_B2!($C, $A, $B)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  11.083 μs …  22.625 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     11.167 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   11.189 μs ± 306.488 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                         ▁█                                     
  ▂▁▁▁▁▁▁▁▁▁▁▁▅▁▁▁▁▁▁▁▁▁▁██▁▁▁▁▁▁▁▁▁▁▅▄▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▂ ▂
  11.1 μs         Histogram: frequency by time         11.3 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> cpu_freq = 3.2e9 # Ghz
3.2e9

julia> 11.083e-6 * cpu_freq / N^3
0.13529052734375

julia> versioninfo()
Julia Version 1.8.0-DEV.901
Commit 47255f9fe8* (2021-11-06 19:56 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.1.0)
  CPU: Apple M1

Rosetta cannot use FMA instructions, because Westmere x86 CPUs do not have them, so it is limited to at best half the potential performance achievable.

Also, with LV, you can write the loops is any order you'd like, and it should interchange them as necessary.

chriselrod avatar Nov 07 '21 20:11 chriselrod

The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

certik avatar Nov 07 '21 20:11 certik

Here is the inner loop LV generates for Neon:

L328:
	ld1r	{ v10.2d }, [x20], #8
	ldp	q0, q11, [x15]
	ldp	q12, q13, [x15, #32]
	ldr	d14, [x1, x7]
	add	x15, x15, x6
	fmla	v29.2d, v10.2d, v0.2d
	fmla	v24.2d, v10.2d, v11.2d
	fmla	v17.2d, v10.2d, v12.2d
	fmla	v2.2d, v10.2d, v13.2d
	ldr	d10, [x1, x9]
	fmla	v30.2d, v0.2d, v14.d[0]
	fmla	v25.2d, v11.2d, v14.d[0]
	fmla	v18.2d, v12.2d, v14.d[0]
	fmla	v4.2d, v13.2d, v14.d[0]
	ldr	d14, [x1, x10]
	fmla	v31.2d, v0.2d, v10.d[0]
	fmla	v26.2d, v11.2d, v10.d[0]
	fmla	v19.2d, v12.2d, v10.d[0]
	fmla	v5.2d, v13.2d, v10.d[0]
	ldr	d10, [x1, x12]
	fmla	v8.2d, v0.2d, v14.d[0]
	fmla	v27.2d, v11.2d, v14.d[0]
	fmla	v20.2d, v12.2d, v14.d[0]
	fmla	v6.2d, v13.2d, v14.d[0]
	ldr	d14, [x1, x14]
	fmla	v9.2d, v0.2d, v10.d[0]
	fmla	v28.2d, v11.2d, v10.d[0]
	fmla	v21.2d, v12.2d, v10.d[0]
	fmla	v7.2d, v13.2d, v10.d[0]
	cmp	x15, x11
	fmla	v23.2d, v0.2d, v14.d[0]
	fmla	v22.2d, v11.2d, v14.d[0]
	fmla	v16.2d, v12.2d, v14.d[0]
	fmla	v3.2d, v13.2d, v14.d[0]
	mov	x1, x20
	b.ls	L328

It has 24 fmas 10 loads 0 stores

chriselrod avatar Nov 07 '21 20:11 chriselrod

The Clang generated assembly looks pretty good. But it's still slow. 0.135 is decent, that is 92% peak.

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

Getting close to the peak is very easy on Tiger Lake, e.g. using 72x72 matrices:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulB!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.70e+09  100.0%  #  4.2 cycles per ns
└ task-clock               1.11e+09  100.0%  #  1.1 s
┌ instructions             8.13e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.78e+08  100.0%  #  2.2% of instructions
└ branch-misses            2.50e+06  100.0%  #  1.4% of branch instructions
┌ L1-dcache-load-misses    5.48e+08  100.0%  # 21.0% of dcache loads
│ L1-dcache-loads          2.61e+09  100.0%
│ cache-misses             1.73e+03  100.0%  # 72.7% of cache references
└ cache-references         2.38e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 4.7e9 / (100_000 * 72^3)
0.12592163923182442

Or using C[m,n] += instead:

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulBplusC!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               4.74e+09  100.0%  #  4.2 cycles per ns
└ task-clock               1.12e+09  100.0%  #  1.1 s
┌ instructions             8.19e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.78e+08  100.0%  #  2.2% of instructions
└ branch-misses            2.50e+06  100.0%  #  1.4% of branch instructions
┌ L1-dcache-load-misses    5.51e+08  100.0%  # 20.7% of dcache loads
│ L1-dcache-loads          2.67e+09  100.0%
│ cache-misses             1.79e+03  100.0%  # 68.8% of cache references
└ cache-references         2.60e+03  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 4.74e9 / (100_000 * 72^3)
0.12699331275720166

As before, Tiger Lake is very starved on FMA units, hence it's easy to get bottlenecked there, and thus achieve nearly peak FMA unit utilization. But this is about as close as I can get.

The assembly:

L480:
	vbroadcastsd	zmm29, qword ptr [r8]
	vmovupd	zmm30, zmmword ptr [rdx]
	vmovupd	zmm31, zmmword ptr [rdx + 64]
	vmovupd	zmm28, zmmword ptr [rdx + 128]
	prefetcht0	byte ptr [rdx + rbx]
	prefetcht0	byte ptr [rdx + rbx + 64]
	prefetcht0	byte ptr [rdx + rbx + 128]
	vfmadd231pd	zmm19, zmm30, zmm29     # zmm19 = (zmm30 * zmm29) + zmm19
	vfmadd231pd	zmm10, zmm31, zmm29     # zmm10 = (zmm31 * zmm29) + zmm10
	vbroadcastsd	zmm0, qword ptr [r8 + r14]
	vfmadd231pd	zmm1, zmm28, zmm29      # zmm1 = (zmm28 * zmm29) + zmm1
	vfmadd231pd	zmm20, zmm30, zmm0      # zmm20 = (zmm30 * zmm0) + zmm20
	vfmadd231pd	zmm11, zmm31, zmm0      # zmm11 = (zmm31 * zmm0) + zmm11
	vfmadd231pd	zmm2, zmm28, zmm0       # zmm2 = (zmm28 * zmm0) + zmm2
	vbroadcastsd	zmm0, qword ptr [r8 + 2*r14]
	vfmadd231pd	zmm21, zmm30, zmm0      # zmm21 = (zmm30 * zmm0) + zmm21
	vfmadd231pd	zmm12, zmm31, zmm0      # zmm12 = (zmm31 * zmm0) + zmm12
	vfmadd231pd	zmm3, zmm28, zmm0       # zmm3 = (zmm28 * zmm0) + zmm3
	vbroadcastsd	zmm0, qword ptr [r8 + r13]
	vfmadd231pd	zmm22, zmm30, zmm0      # zmm22 = (zmm30 * zmm0) + zmm22
	vfmadd231pd	zmm13, zmm31, zmm0      # zmm13 = (zmm31 * zmm0) + zmm13
	vfmadd231pd	zmm4, zmm28, zmm0       # zmm4 = (zmm28 * zmm0) + zmm4
	vbroadcastsd	zmm0, qword ptr [r8 + 4*r14]
	vfmadd231pd	zmm23, zmm30, zmm0      # zmm23 = (zmm30 * zmm0) + zmm23
	vfmadd231pd	zmm14, zmm31, zmm0      # zmm14 = (zmm31 * zmm0) + zmm14
	vfmadd231pd	zmm5, zmm28, zmm0       # zmm5 = (zmm28 * zmm0) + zmm5
	vbroadcastsd	zmm0, qword ptr [r8 + r15]
	vfmadd231pd	zmm24, zmm30, zmm0      # zmm24 = (zmm30 * zmm0) + zmm24
	vfmadd231pd	zmm15, zmm31, zmm0      # zmm15 = (zmm31 * zmm0) + zmm15
	vfmadd231pd	zmm6, zmm28, zmm0       # zmm6 = (zmm28 * zmm0) + zmm6
	vbroadcastsd	zmm0, qword ptr [r8 + 2*r13]
	vfmadd231pd	zmm25, zmm30, zmm0      # zmm25 = (zmm30 * zmm0) + zmm25
	vfmadd231pd	zmm16, zmm31, zmm0      # zmm16 = (zmm31 * zmm0) + zmm16
	vbroadcastsd	zmm29, qword ptr [r8 + rax]
	vfmadd231pd	zmm7, zmm28, zmm0       # zmm7 = (zmm28 * zmm0) + zmm7
	vfmadd231pd	zmm26, zmm30, zmm29     # zmm26 = (zmm30 * zmm29) + zmm26
	vfmadd231pd	zmm17, zmm31, zmm29     # zmm17 = (zmm31 * zmm29) + zmm17
	vfmadd231pd	zmm8, zmm28, zmm29      # zmm8 = (zmm28 * zmm29) + zmm8
	vbroadcastsd	zmm0, qword ptr [r8 + 8*r14]
	vfmadd231pd	zmm27, zmm0, zmm30      # zmm27 = (zmm0 * zmm30) + zmm27
	vfmadd231pd	zmm18, zmm0, zmm31      # zmm18 = (zmm0 * zmm31) + zmm18
	vfmadd231pd	zmm9, zmm28, zmm0       # zmm9 = (zmm28 * zmm0) + zmm9
	add	rdx, r10
	add	r8, 8
	cmp	rdx, rsi
	jbe	L480

This is 27 fmas 12 loads (+3 prefetches) 0 stores

chriselrod avatar Nov 07 '21 20:11 chriselrod

Doubling the FMA units on Cascadelake (and sacrificing a lot of out of order capabilities, a bit of cache size, etc...):

julia> @pstats "cpu-cycles,(instructions,branch-instructions,branch-misses),(task-clock,context-switches,cpu-migrations,page-faults),(L1-dcache-load-misses,L1-dcache-loads,L1-icache-load-misses),(dTLB-load-misses,dTLB-loads)" begin
         foreachf(AmulBplusC!, 100_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
╶ cpu-cycles               2.51e+09   49.9%  #  3.6 cycles per ns
┌ instructions             8.24e+09   75.0%  #  3.3 insns per cycle
│ branch-instructions      1.88e+08   75.0%  #  2.3% of instructions
└ branch-misses            1.01e+06   75.0%  #  0.5% of branch instructions
┌ task-clock               7.01e+08  100.0%  # 701.0 ms
│ context-switches         0.00e+00  100.0%
│ cpu-migrations           0.00e+00  100.0%
└ page-faults              0.00e+00  100.0%
┌ L1-dcache-load-misses    6.90e+08   25.0%  # 25.7% of dcache loads
│ L1-dcache-loads          2.69e+09   25.0%
└ L1-icache-load-misses    2.43e+04   25.0%
┌ dTLB-load-misses         0.00e+00   25.0%  #  0.0% of dTLB loads
└ dTLB-loads               2.69e+09   25.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 2.51e9 / (100_000 * 72^3)
0.06724751371742113

We're at about 93% instead of the 98%+ on Tiger Lake.

chriselrod avatar Nov 07 '21 20:11 chriselrod

The Clang results hint that gemm_nkm is probably exactly written to be optimal and to vectorize over m:

The inner most loop in the two assembly examples I posted above were in fact the k loop, while both the m and n loops were unrolled. LV will "canonicalize" to produce this code regardless of the initial loop order.

The approach is to:

  1. Hoist the C loads and stores out of the loop.
  2. Unroll m and also vectorize it.
  3. Unroll n and also vectorize it.

For each factor by which n is unrolled, you can reuse a load from A[m,k]. For each factor by which m is unrolled, you can reuse a load from B[k,n]. Hence, unrolling both of these loops is a great means of maximizing load-reuse, and thus increasing the fma-to-load ratio.

In both assembly examples above, the m unrolling happens first (with all loads from A), and then the unrolled code "iteratively" loads/broadcasts scalars from B.

We are of course limited by the number of named vector registers (the actual number of vector registers numbers in the hundreds for each of these architectures...) in how much unrolling we can actually do without spilling registers, which would of course defeat the point of reducing the amount of loads and stores.

chriselrod avatar Nov 07 '21 21:11 chriselrod

I thought I got 100% of the peak on Intel hardware about 5 years ago. However the matrices might have been larger. For smaller matrices (and 64 I think is too small) I don't get the peak.

Are you able to get 100% of the peak for large enough matrices in Julia, whether M1 or Intel? I don't care which platform.

LV currently doesn't do cache-level tiling, which is necessary for good performance outside of the L2 cache.

The best I seem to be able to do (Tiger Lake) is around the 144-square mark:

julia> M = K = N = 144; A = rand(M,K); B = rand(K,N); C1 = A*B; C0 = similar(C1);

julia> C0 .= 0;

julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
         foreachf(AmulBplusC!, 10_000, C0, A, B)
       end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles               3.76e+09  100.0%  #  4.2 cycles per ns
└ task-clock               8.91e+08  100.0%  # 891.5 ms
┌ instructions             6.45e+09  100.0%  #  1.7 insns per cycle
│ branch-instructions      1.40e+08  100.0%  #  2.2% of instructions
└ branch-misses            9.73e+05  100.0%  #  0.7% of branch instructions
┌ L1-dcache-load-misses    4.92e+08  100.0%  # 23.4% of dcache loads
│ L1-dcache-loads          2.10e+09  100.0%
│ cache-misses             5.42e+03  100.0%  # 22.6% of cache references
└ cache-references         2.40e+04  100.0%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

julia> 3.76e9 / (10_000 * 144^3)
0.12592163923182442

julia> 0.125/ans
0.9926808510638298

While the amount of memory occupied by the matrices is O(N^2), the amount of memory bandwidth needed is O(N^3), thus this becomes the dominating factor to achieving good performance at large sizes.

chriselrod avatar Nov 07 '21 21:11 chriselrod

Thanks for the analysis. I mean, at the end of the day, over 90% peak is excellent and you are getting it everywhere.

For example on M1, Clang gets about 60% of peak if I help it (gemm_nkm) or at 7% peak without help (gemm_mnk). My own Julia+LV due to the running via Rosetta runs at 43%, but no matter how you write it. And on your machine when compiled properly it runs at 92% peak.

I would say that if LV consistently delivers over 90% peak, that's very good. It means if somebody comes and create a library that runs at 100% peak, it will only be 1.1x faster (10%).

It'd be nice to visualize this, to make it clear. If you plot the clock counts, you will have a plot at the bottom, that is the peak, and then everything else is above it. And you can see how far each benchmark is.

certik avatar Nov 08 '21 01:11 certik