LoopVectorization.jl
LoopVectorization.jl copied to clipboard
Plot against theoretical peak performance
I started with these plots:
https://juliasimd.github.io/LoopVectorization.jl/stable/examples/matrix_vector_ops/
The issue is that I can't tell from the plot if the performance is already the best you can do, or if there is still room for improvement. To answer that question, I like to plot clock cycles (per loop body / array element) on the vertical axis, and also plot the theoretical performance peak into the same graph. Let's do that for the very first benchmark:
julia> function jgemvavx!(ðē, ð, ðą)
@turbo for i â eachindex(ðē)
ðēi = zero(eltype(ðē))
for j â eachindex(ðą)
ðēi += ð[i,j] * ðą[j]
end
ðē[i] = ðēi
end
end
jgemvavx! (generic function with 1 method)
julia> N = 64
64
julia> y = rand(N);
julia> A = rand(N, N);
julia> x = rand(N);
julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
Range (min âĶ max): 427.970 ns âĶ 610.970 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 428.603 ns â GC (median): 0.00%
Time (mean Âą Ï): 430.210 ns Âą 9.385 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
âââ
â â â
âââââ
ââââââ
ââ
â
â
â
âââ
âââââââââââââââââââ
â
â
âââ
âââââââââââââââ
â
â
â
â
428 ns Histogram: log(frequency) by time 463 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> cpu_freq = 3.2e9 # Ghz
3.2e9
julia> 427.136e-9 * cpu_freq / N^2
0.3337
So for N=64
, using double precision, it seems the body of the loop takes 0.333 clock cycles.
I use Apple M1. I installed the macOS Julia, I assume that is Intel based? Somehow it works for me.
Regarding the theoretical peak, the loop body is just:
ðēi += ð[i,j] * ðą[j]
Everything else I think gets amortized. So it is two memory reads from L1 cache, each takes 0.1665 clock cycles (ldr q0, [x1]
takes 0.333). And one fma, which takes 0.125 (fmla.2d v0, v0, v0
takes 0.25). The bottleneck in this case is the memory read, total of 0.333 clock cycles. The fma runs at the same time, so we do not include it.
So according to my analysis, the code runs at 100% of the theoretical peak. Which sounds too good to be true. I am worried I made some mistake in my analysis somewhere. But I am posting what I have so far.
Using the same analysis, for N=128 I get
julia> 1.700e-6 * cpu_freq / N^2
0.33203125
For N=256:
julia> 10.083e-6 * cpu_freq / N^2
0.492333984375
The L1 cache is 320 KB, the L2 cache is 12 MB. No L3 cache. The N=128 matrix takes:
julia> 128^2 * 8 / 1024
128.0
128 KB, so it fits in L1. The N=256 matrix takes:
julia> 256^2 * 8 / 1024
512.0
So it doesn't completely fit.
My understanding is that this is double precision:
julia> typeof(A)
Matrix{Float64} (alias for Array{Float64, 2})
So I can't see any mistakes in my analysis. How can I see the assembly code that is actually being run? I tried:
julia> @code_native jgemvavx!(y, A, x)
.section __TEXT,__text,regular,pure_instructions
; â @ REPL[52]:1 within `jgemvavx!'
pushq %r14
pushq %rbx
subq $56, %rsp
movq %rsi, %rbx
xorps %xmm0, %xmm0
movaps %xmm0, 16(%rsp)
movaps %xmm0, (%rsp)
movq $0, 32(%rsp)
movq %rsi, 48(%rsp)
movabsq $jl_get_ptls_states_fast, %rax
callq *%rax
movq %rax, %r14
movq $12, (%rsp)
movq (%rax), %rax
...
movq %rax, 32(%rsp)
movq %rbx, 24(%rsp)
movq %r10, 16(%rsp)
movabsq $"macro expansion;", %rax
callq *%rax
movq 8(%rsp), %rax
movq %rax, (%r14)
movabsq $jl_system_image_data, %rax
addq $56, %rsp
popq %rbx
popq %r14
retq
But it prints x86 assembly. And it doesn't print the actual function, just the driver it seems.
I found one worrying case:
julia> N=130
130
julia> y = rand(N); A = rand(N,N); x = rand(N);
julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min âĶ max): 1.633 Ξs âĶ 3.646 Ξs â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 1.663 Ξs â GC (median): 0.00%
Time (mean Âą Ï): 1.680 Ξs Âą 82.119 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ
ââââ
â
â
âââââ â
ââââââââââââââââââââââ
ââââââââââââââââ
ââ
ââââââ
ââââââ
ââ
ââââ â
1.63 Ξs Histogram: log(frequency) by time 2.09 Ξs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> 1.633e-6 * cpu_freq / N^2
0.309207100591716
It gets faster than the peak. That shouldn't be possible. Although I seem to be getting similar behavior with gfortran below, so perhaps the CPU frequency got higher temporarily.
I also tested GFortran:
program test_gemv
implicit none
integer, parameter :: N = 64, Niter = 100000, dp = kind(0.d0)
real(dp) :: A(N,N), x(N), y(N)
real(dp) :: t1, t2, cpu_freq
integer :: m
A = 1
x = 1
call cpu_time(t1)
do m = 1, Niter
call gemv(y, A, x)
end do
call cpu_time(t2)
print *, y(1)
print *, (t2-t1)/Niter * 1e9_dp, "ns"
cpu_freq = 3.2e9_dp
print *, (t2-t1)/Niter / N**2 * cpu_freq, "clock cycles"
contains
subroutine gemv(y, A, x)
real(dp), intent(out) :: y(:)
real(dp), intent(in) :: A(:,:), x(:)
integer :: i, j, yi
do i = 1, size(y)
yi = 0
do j = 1, size(x)
yi = yi + A(j,i) * x(j)
end do
y(i) = yi
end do
end subroutine
end program
But it prints:
$ gfortran -O3 -ffast-math -funroll-loops a.f90
$ ./a.out
...
10.541718750000001 clock cycles
Which is way too slow, so I am not quite sure what is going on. I tried the intrinsic matmul:
--- a/a.f90
+++ b/a.f90
@@ -24,14 +23,7 @@ contains
subroutine gemv(y, A, x)
real(dp), intent(out) :: y(:)
real(dp), intent(in) :: A(:,:), x(:)
- integer :: i, j, yi
- do i = 1, size(y)
- yi = 0
- do j = 1, size(x)
- yi = yi + A(j,i) * x(j)
- end do
- y(i) = yi
- end do
+ y = matmul(A, x)
end subroutine
end program
And that gives:
N clock cycles
32 0.316
64 0.298
128 0.315
256 0.417
512 0.416
1024 0.442
2048 1.140
I don't trust these numbers yet. To make it reliable, I have to hook it into my benchmark driver, where I compile the kernel as a separate unit, and then just link it in the driver. That way I know for sure the compiler didn't make some optimizations with the iteration loop.
Found the mistake:
--- a/a.f90
+++ b/a.f90
@@ -24,7 +24,8 @@ contains
subroutine gemv(y, A, x)
real(dp), intent(out) :: y(:)
real(dp), intent(in) :: A(:,:), x(:)
- integer :: i, j, yi
+ integer :: i, j
+ real(dp) :: yi
do i = 1, size(y)
yi = 0
do j = 1, size(x)
Now I am getting 0.499
for N=64
.
I use Apple M1. I installed the macOS Julia, I assume that is Intel based? Somehow it works for me.
Yes. You can build Julia 1.7 and 1.8 from source for ARM on macOS, but there are a few issues. By far the most important issue is this. The SVD mentioned in the issue title is just one example. It is pervasive; segmentation faults are common. Many packages, including for example LoopVectorization.jl and Distributions.jl, cannot complete their tests on Mac ARM without segfaulting.
I installed an AArch64 Linux VM on my M1 Mac Mini, and LoopVectorization's tests passed following this commit.
Keno suspects the problem exists with Linux ARM as well (i.e. that the large code model is inappropriate for ARM generally), but that macOS reorganizes memory much more aggressively, resulting in the symptoms manifesting themselves much more frequently. I haven't used Julia in the Linux VM enough to see a segfault, and Julia is tested on Linux AArch64.
Keno bought a really nice 16 inch MacBook Pro earlier this week, so maybe he'll get the small code model to work.
But this is all so far out of my expertise, I only have the roughest idea of what large vs small code model even means, that I can't comment much.
If you're using Rosetta, you'll be a) missing out on FMA instructions (NEON has them). b) only have 16 named vector registers (NEON has 32). because Rosetta acts like a Westmere CPU.
You'll of course notice that the assembly you did see was all x64.
I'm still working on a reply w/ respect to the actual gemv.
Thanks @chriselrod, that explains everything regarding the current status of the tooling. I realize the the M1 might not be ideal yet for Julia. I really like it for benchmarking otherwise, as it gives reliable performance, the CPU always runs at 3.2GHz, etc. I also like it for development, as it compiles really quickly.
On the Apple M1, this is what I get, Rosetta:
julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
Range (min âĶ max): 414.990 ns âĶ 465.241 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 415.829 ns â GC (median): 0.00%
Time (mean Âą Ï): 416.694 ns Âą 3.688 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
415 ns Histogram: frequency by time 430 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.6.1
Commit 6aaedecc44 (2021-04-23 05:59 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.7.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.1 (ORCJIT, westmere)
Note the use of $
. This causes the benchmark to treat the arguments as local variables instead of globals.
Running Julia natively might be unreliable, as described above, but it's reliable enough for a few simple benchmarks:
julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 308 evaluations.
Range (min âĶ max): 272.997 ns âĶ 357.143 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 273.539 ns â GC (median): 0.00%
Time (mean Âą Ï): 274.313 ns Âą 3.552 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââ
â âââ â
âââââââââââââââââââââââââââââââââââââââââââââââââââââââââ
ââââ â
273 ns Histogram: log(frequency) by time 289 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> cpu_freq = 3.2e9 # Ghz
3.2e9
julia> 273e-9 * cpu_freq / N^2
0.21328125
julia> versioninfo()
Julia Version 1.8.0-DEV.901
Commit 47255f9fe8* (2021-11-06 19:56 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.1.0)
CPU: Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-12.0.1 (ORCJIT, cyclone)
Environment:
JULIA_NUM_THREADS = 4
So 0.213 cycles/iteration.
AArch64 assembly (I'll describe how to get it in my next response), should explain how this is possible:
L60:
ldr d16, [x3, x10, lsl #3]
ldp q17, q18, [x11]
add x10, x10, #1 ; =1
cmp x10, x1
fmla v0.2d, v17.2d, v16.d[0]
ldp q19, q17, [x11, #32]
fmla v1.2d, v18.2d, v16.d[0]
fmla v2.2d, v19.2d, v16.d[0]
ldp q18, q19, [x11, #64]
fmla v3.2d, v17.2d, v16.d[0]
fmla v4.2d, v18.2d, v16.d[0]
ldp q17, q18, [x11, #96]
fmla v5.2d, v19.2d, v16.d[0]
add x11, x11, x5
fmla v6.2d, v17.2d, v16.d[0]
fmla v7.2d, v18.2d, v16.d[0]
b.lt L60
One iteration of the loop featured 1x load with immediate offset (1 load) 4x load pairs (8 loads) 8x fma
The 8 fmas are of course neon fmas, meaning they operate on 128 bit registers. Thus these 8 fmas perform 16 loop iterations.
Cool, I can reproduce your numbers using Rosetta:
julia> N=64
64
julia> y = rand(N); A = rand(N,N); x = rand(N);
julia> @benchmark jgemvavx!(y, A, x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
Range (min âĶ max): 427.764 ns âĶ 693.467 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 436.347 ns â GC (median): 0.00%
Time (mean Âą Ï): 437.712 ns Âą 9.054 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
â â
ââââ â â
âââââââââââââââââââââââââââ
ââââ
â
â
ââ
ââââââ
ââââ
ââ
ââââ
âââââââ
â
ââ â
428 ns Histogram: log(frequency) by time 460 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 199 evaluations.
Range (min âĶ max): 417.714 ns âĶ 680.905 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 418.342 ns â GC (median): 0.00%
Time (mean Âą Ï): 421.108 ns Âą 8.790 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââââ ââââ â â â
ââââââ
âââââââââââââ
ââââââââââââââââââââââ
ââ
â
âââ
ââââââââ
ââââ
ââ â
418 ns Histogram: log(frequency) by time 442 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
I can't get it below 418ns, but that's close enough to yours 415ns.
I am analyzing your arm assembly loop now.
Thanks @chriselrod, that explains everything regarding the current status of the tooling. I realize the the M1 might not be ideal yet for Julia. I really like it for benchmarking otherwise, as it gives reliable performance, the CPU always runs at 3.2GHz, etc. I also like it for development, as it compiles really quickly.
The M1 Native on Julia is reliable enough for light benchmarking (and if you can setup an AArch64 Linux VM/happen to already have one, that seems fine in practice). Building Julia from source is fairly easy if you already have compiler tools. Most binary dependencies should be downloaded (I believe even on Apple M1, but it's been while -- maybe it compiled them some time ago and I've forgotten), and from there compilation is very fast, as you note. Something like 50% faster than any of my x64 machines. Of course, if you follow that, you'd need at least Julia 1.7 for things to work at all. I'd just use master.
EDIT: FWIW, my M1 is on a Mini. Maybe it is better cooled/able to clock slightly higher? Not that cooling is normally an issue for the M1.
I use the M1 MacBook Pro. The fan never turns on. The 8 core M1 laptop is about as fast for C++ compilation as my beefy 48 core Intel machine.
Does the assembly code you provided correspond to this inner loop?
for j â eachindex(ðą)
ðēi += ð[i,j] * ðą[j]
end
I see 8 fmas, each operates on two doubles, so it is unrolled 16 times. Each fma takes 0.25 cycles. So the cost of just one loop body is 0.125 for the single double fma operation. I see 4 ldp, each loads two doubles, so it loads 8 doubles. Shouldn't there be 16 loads? I assume those are loads for x[j]
. What about A[i,j]
(I assume that is the ldr
load)? I would expect another 16 single double loads.
In Julia, the last index is continuous?
In Julia, the last index is continuous?
The first index, like Fortran.
The issue is that I can't tell from the plot if the performance is already the best you can do, or if there is still room for improvement
It's hard to say. If we only look at the execution units, the CPU I ran on is capable of about 120 GFLOPS/core, depending on the clock speeds I have it set at. For these benchmarks, I think it was 3.7 or 3.8 GHz, which would put it in the area of 118-122 peak theoretical GFLOP if only looking at the FMA units. The matrix-multiply plots' peaks reached roughly these values, so that's probably about right.
Of course, only gemm comes close to being limited by execution units. Everything else is limited by memory, so focusing only on GFLOPS is perhaps an unrealistic target of "what we wish we could achieve."
I do like the view of cycles/loop iteration though.
The L1 cache is 320 KB
The data cache however is "only" 192 KiB. Compared to the 32 KiB of most Intel/AMD CPUs. Intel Ice Lake and newer have 48 KiB.
So I can't see any mistakes in my analysis. How can I see the assembly code that is actually being run? I tried:
I use Cthulhu.jl. LoopVectorization currently works by compiling a function called _turbo_!. This function often isn't inlined.
Cthulhu lets you descend into a chain of function calls.
julia> using LoopVectorization, Cthulhu
julia> function jgemvavx!(ðē, ð, ðą)
@turbo for i â eachindex(ðē)
ðēi = zero(eltype(ðē))
for j â eachindex(ðą)
ðēi += ð[i,j] * ðą[j]
end
ðē[i] = ðēi
end
end
jgemvavx! (generic function with 1 method)
julia> N = 64
64
julia> y = rand(N);
julia> A = rand(N, N);
julia> x = rand(N);
julia> @descend jgemvavx!(y, A, x)
You should now see something a little like:
36 â %48 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(4)))::Ptr{Float64}
â %49 = $(Expr(:foreigncall, :(:jl_array_ptr), Ptr{Float64}, svec(Any), 0, :(:ccall), Core.Argument(2)))::Ptr{Float64}
â %50 = $(Expr(:gc_preserve_begin, Core.Argument(3), Core.Argument(4), Core.Argument(2)))
â invoke LoopVectorization._turbo_!($(QuoteNode(Val{(false, 0, 0, 0, false, 8, 64, 32, 64, 32768, 1048576, 20185088, 0x0000000000000001)}()))::Val{(false, 0, 0, 0, false, 8, 64, 32, 64, 32768, 1048576, 20185088, 0x0000000000000001)}, $(QuoteNode(Val{(:numericconstant, Symbol("###zero###5###"), LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0003, 0x02), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000200030001, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0004, 0x03))}()))::Val{(:numericconstant, Symbol("###zero###5###"), LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.constant, 0x0001, 0x00), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0002, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, 0x00000000000000000000000000000000, LoopVectorization.memload, 0x0003, 0x02), :LoopVectorization, :vfmadd_fast, LoopVectorization.OperationStruct(0x00000000000000000000000000000012, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000200030001, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :identity, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000004, 0x00000000000000000000000000000000, LoopVectorization.compute, 0x0001, 0x00), :LoopVectorization, :setindex!, LoopVectorization.OperationStruct(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000005, 0x00000000000000000000000000000000, LoopVectorization.memstore, 0x0004, 0x03))}, $(QuoteNode(Val{(LoopVectorization.ArrayRefStruct{:ð, Symbol("##vptr##_ð")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000102, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:ðą, Symbol("##vptr##_ðą")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:ðē, Symbol("##vptr##_ðē")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}()))::Val{(LoopVectorization.ArrayRefStruct{:ð, Symbol("##vptr##_ð")}(0x00000000000000000000000000000101, 0x00000000000000000000000000000102, 0x00000000000000000000000000000000, 0x00000000000000000000000000000101), LoopVectorization.ArrayRefStruct{:ðą, Symbol("##vptr##_ðą")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000002, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001), LoopVectorization.ArrayRefStruct{:ðē, Symbol("##vptr##_ðē")}(0x00000000000000000000000000000001, 0x00000000000000000000000000000001, 0x00000000000000000000000000000000, 0x00000000000000000000000000000001))}, $(QuoteNode(Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}()))::Val{(0, (), (), (), (), ((1, LoopVectorization.IntOrFloat),), ())}, $(QuoteNode(Val{(:i, :j)}()))::Val{(:i, :j)}, $(QuoteNode(Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1,), (1,)), ((1, 2), (3,), (4,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{4, Static.StaticInt{0}}}}}}()))::Val{Tuple{Tuple{CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}, CloseOpenIntervals.CloseOpen{Static.StaticInt{0}, Int64}}, Tuple{LayoutPointers.GroupedStridedPointers{Tuple{Ptr{Float64}, Ptr{Float64}, Ptr{Float64}}, (1, 1, 1), (0, 0, 0), ((1, 2), (1,), (1,)), ((1, 2), (3,), (4,)), Tuple{Static.StaticInt{8}, Int64, Static.StaticInt{8}, Static.StaticInt{8}}, NTuple{4, Static.StaticInt{0}}}}}}, %4::Int64, %22::Int64, %37::Ptr{Float64}, %48::Ptr{Float64}, %49::Ptr{Float64}, %41::Int64)::Any
â $(Expr(:gc_preserve_end, :(%50))) â
ââââ return nothing â
Select a call to descend into or âĐ to ascend. [q]uit. [b]ookmark.
Toggles: [o]ptimize, [w]arn, [h]ide type-stable statements, [d]ebuginfo, [r]emarks, [i]nlining costs, [t]ype annotations, [s]yntax highlight for Source/LLVM/Native.
Show: [S]ource code, [A]ST, [T]yped code, [L]LVM IR, [N]ative code
Actions: [E]dit source code, [R]evise and redisplay
Advanced: dump [P]arams cache.
âĒ %51 = invoke _turbo_!(::Val{âĶ},::Val{âĶ},::Val{âĶ},::Val{âĶ},::Val{âĶ},::Val{âĶ},::Int64{âĶ},::Int64{âĶ},::Ptr{âĶ},::Ptr{âĶ},::Ptr{âĶ},::Int64{âĶ})::Any
âĐ
If there were multiple non-inlined functions, you could nagivate the dot currently at %51
up and down to select whichever one you chose. Currently, the only two navigation options are %51
to descend into _turbo_!
and âĐ
to ascend.
There are also a lot of other options listed under toggles, show, and actions.
E.g., you can turn Julia level optimizations on/off with o
, highlight type instabilities with w
, etc.
L
and N
let us look at the LLVM and Native code.
So by pressing [enter]
with the dot on %51 = invoke _turbo_!
, and hitting N
, I get something like (I actually had to hit [d]
twice to toggle off the debug information, and then N
again to return to the native code view):
L96:
vbroadcastsd zmm8, qword ptr [rcx + 8*rbx]
vfmadd231pd zmm0, zmm8, zmmword ptr [rax] # zmm0 = (zmm8 * mem) + zmm0
vfmadd231pd zmm1, zmm8, zmmword ptr [rax + 64] # zmm1 = (zmm8 * mem) + zmm1
vfmadd231pd zmm2, zmm8, zmmword ptr [rax + 128] # zmm2 = (zmm8 * mem) + zmm2
vfmadd231pd zmm3, zmm8, zmmword ptr [rax + 192] # zmm3 = (zmm8 * mem) + zmm3
vfmadd231pd zmm4, zmm8, zmmword ptr [rax + 256] # zmm4 = (zmm8 * mem) + zmm4
vfmadd231pd zmm5, zmm8, zmmword ptr [rax + 320] # zmm5 = (zmm8 * mem) + zmm5
vfmadd231pd zmm6, zmm8, zmmword ptr [rax + 384] # zmm6 = (zmm8 * mem) + zmm6
vfmadd231pd zmm7, zmm8, zmmword ptr [rax + 448] # zmm7 = (zmm8 * mem) + zmm7
inc rbx
add rax, r9
cmp rbx, rsi
jl L96
This is similar to the ARM assembly above. Load from x
once, and then 8
loads from A
, using them each for an fma
.
This is the approach it takes:
Unroll (and SIMD) the i
loop aggressively, because you can reuse a load from x
for every iteration of the i
loop.
FWIW, that gets me
julia> @benchmark jgemvavx!($y, $A, $x)
BenchmarkTools.Trial: 10000 samples with 852 evaluations.
Range (min âĶ max): 141.597 ns âĶ 214.688 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 142.251 ns â GC (median): 0.00%
Time (mean Âą Ï): 142.467 ns Âą 1.163 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
âââ
âââââ
ââââââââââââââââ
âââââââââââââââââââââââââââââââââââââââââââââ â
142 ns Histogram: frequency by time 145 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> cpu_freq = 3.6e9; # set in the bios
julia> 141.597e-9 * cpu_freq / N^2
0.12445048828125
The zmm
register is 512 bit, so that can hold 512/64 = 8 doubles. What is the reciprocal throughput of the vfmadd231pd
instruction on your platform? It seems that is 1 cycle for IceLake. So that is 0.125 per double, as fast as the M1. However your instruction also reads from memory, I am not sure what the cost is there. Assuming the theoretical peak is 0.125 cycles. You are measuring at 0.1245 clock cycles per double, so it is running essentially at 100% peak?
I use the M1 MacBook Pro. The fan never turns on. The 8 core M1 laptop is about as fast for C++ compilation as my beefy 48 core Intel machine.
This would definitely be true for building Julia. Almost all of the parallelizable parts of building Julia (e.g., LLVM, OpenBLAS) have been replaced with binary downloads. This leaves the C/C++ within Julia itself, and then the majority remaining is single threaded.
Does the assembly code you provided correspond to this inner loop?
Yes, although LV will often rearrange loops.
I see 8 fmas, each operates on two doubles, so it is unrolled 16 times. Each fma takes 0.25 cycles. So the cost of just one loop body is 0.125 for the single double fma operation. I see 4 ldp, each loads two doubles, so it loads 8 doubles.
I'm pretty sure the ldp
s are loading into q
registers, which are 128 bits on ARMv8, and not 64 like in x64.
An x64 word is 16 bits ,vs 32 bits on ARM, so the double/quadwords of ARM are also twice as large.
Thus each ldp
is loading two 128 bit quadwords, which are then reinterpreted as vectors of 2x Float64
by the fma.
The ldps are loading from the matrix A
.
The ldr
must be used as an x86 broadcast here, loading from x
.
So basically, broadcast one element from x
, and then reuse it for 16 elements of A
, before moving on to the next iteration.
What is the reciprocal throughput of the vfmadd231pd instruction on your platform? It seems that is 1 cycle for IceLake.
0.5 cycles. My platform is Cascadelake, but I also have a Tiger lake, which is 1 cycle (like IceLake). So I think it is only running at about 1/2 peak.
However your instruction also reads from memory, I am not sure what the cost is there.
https://uops.info/html-instr/VFMADD231PD_ZMM_ZMM_M512.html Theoretically 0.5 "calculated from port usage", but measured at 0.55. However, this is thanks to parallelism of the microops. This one will compete with other loads.
uica is a great tool for Intel architectures. It estimates that the inner loop should take 4.5 cycles on Cascade Lake, or 8 on Tiger Lake.
These correspond to about 0.07 and 0.125 cycles/iteration, meaning I am well behind it's theoretical prediction here.
I also tested GFortran:
Here is the Fortran code I used for the benchmarks. Simple improvements/corrections are welcome, but the point of course is seeing how well the compilers transform loops, so improvements shouldn't be applying them, but encouraging the compiler to do so.
This file is the code I used to compile them (you can see the flags there, which I did try to tune), as well as the wrappers to call and benchmark them from Julia. Specifically, I used
if (Sys.ARCH === :x86_64)
run(`gfortran -Ofast -march=native -funroll-loops -mprefer-vector-width=$(8REGISTER_SIZE) -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 -shared -fPIC $ffile -o $LIBFTEST`)
else
run(`gfortran -Ofast -march=native -funroll-loops -fvariable-expansion-in-unroller --param max-variable-expansions-in-unroller=4 -shared -fPIC $ffile -o $LIBFTEST`)
end
Which is probably already getting fancier with the flags than most, but if there's more that can be done to help, I'm open to it.
Flags like -floop-nest-optimize
seemed to make performance worse, if they had any impact at all.
Intel flags:
ifort -fast -qopt-zmm-usage=high -qoverride-limits -shared -fPIC $ffile -o $LIBIFTEST
Options like -qopt-zmm-usage=high
(and the mprefer-vector-width
from Fortran) of course reflect that I was running these benchmarks on a system with AVX512.
Ok, so the vectorization goes roughly like this if I understand it correctly. You start from:
real(dp) :: yi
do i = 1, size(y)
yi = 0
do j = 1, size(x)
yi = yi + A(i,j) * x(j)
end do
y(i) = yi
end do
Then you vectorize the outer loop:
real(dp) :: yi(16)
do i = 1, size(y), 16
do k = 1, 16
yi = 0
do j = 1, size(x)
yi(k) = yi(k) + A(i+k-1,j) * x(j)
end do
y(i+k-1) = yi(k)
end do
end do
Then you move the k
loop into the inner loop:
real(dp) :: yi(16)
do i = 1, size(y), 16
yi = 0
do j = 1, size(x)
do k = 1, 16
yi(k) = yi(k) + A(i+k-1,j) * x(j)
end do
end do
y(i:i+15) = yi
end do
I haven't tested it, hopefully I didn't make a mistake. Now the inner loop is:
yi(k) = yi(k) + A(i+k-1,j) * x(j)
And the cost is 1 fma, and 1 memory read (R). For now we assume everything else can gets amortized (if not, we'll have to add it to the peak).
On M1, the speed is (in clock cycles per double):
- fma: 0.125
- R: 0.1665
So the bottleneck is memory read, and the theoretical performance peak should be 0.1665. You are getting 0.21328125, so that is 78% peak. That is very impressive.
Btw, the last manually unrolled version runs at 0.305 clock cycles with gfortran, essentially as fast as the intrinsic matmul. Here is the assembly code for the inner loop:
L5:
mov x15, x16
ldr d31, [x17, 56]
ldp d3, d26, [x17, -8]
ldp d30, d29, [x17, 8]
ldr d8, [x15], 8
ldp d28, d27, [x17, 24]
fmadd d2, d8, d26, d2
fmadd d23, d8, d29, d23
ldp d26, d29, [x17, 64]
fmadd d25, d3, d8, d25
ldp d0, d3, [x17, 40]
fmadd d24, d8, d30, d24
fmadd d30, d8, d26, d17
fmadd d22, d8, d28, d22
ldp d26, d17, [x17, 104]
fmadd d21, d8, d27, d21
fmadd d20, d8, d0, d20
ldr d28, [x17, 80]
ldr d0, [x17, 88]
fmadd d31, d8, d31, d18
ldr d27, [x17, 96]
fmadd d29, d8, d29, d16
fmadd d28, d8, d28, d7
fmadd d19, d8, d3, d19
fmadd d26, d8, d26, d4
fmadd d3, d8, d0, d6
fmadd d27, d8, d27, d5
fmadd d1, d8, d17, d1
ldr d0, [x16, 8]
add x16, x15, 8
ldr d5, [x17, 520]
ldr d7, [x17, 536]
ldr d8, [x17, 576]
fmadd d24, d0, d5, d24
ldr d4, [x17, 528]
fmadd d22, d0, d7, d22
ldr d5, [x17, 560]
fmadd d17, d0, d8, d30
ldr d18, [x17, 568]
ldr d16, [x17, 584]
fmadd d23, d0, d4, d23
ldr d7, [x17, 592]
fmadd d19, d0, d5, d19
ldr d8, [x17, 608]
fmadd d18, d0, d18, d31
fmadd d16, d0, d16, d29
ldr d6, [x17, 504]
fmadd d7, d0, d7, d28
ldr d31, [x17, 512]
fmadd d5, d0, d8, d27
ldr d30, [x17, 544]
ldr d29, [x17, 552]
fmadd d25, d6, d0, d25
ldr d28, [x17, 600]
fmadd d2, d0, d31, d2
ldr d4, [x17, 616]
fmadd d21, d0, d30, d21
ldr d27, [x17, 624]
fmadd d20, d0, d29, d20
fmadd d6, d0, d28, d3
add x17, x17, 1024
fmadd d4, d0, d4, d26
fmadd d1, d0, d27, d1
cmp x16, x19
bne L5
I also tried to remove the -funroll-loops
flag from gfortran, now I get:
L5:
ldp d28, d27, [x0, -8]
ldp d26, d25, [x0, 8]
ldr d0, [x2], 8
fmadd d24, d28, d0, d24
fmadd d23, d0, d27, d23
fmadd d22, d0, d26, d22
fmadd d21, d0, d25, d21
ldp d28, d27, [x0, 24]
ldp d26, d25, [x0, 40]
fmadd d20, d0, d28, d20
fmadd d19, d0, d27, d19
ldp d28, d27, [x0, 56]
fmadd d18, d0, d26, d18
fmadd d17, d0, d25, d17
ldp d26, d25, [x0, 72]
fmadd d16, d0, d28, d16
fmadd d7, d0, d27, d7
ldp d28, d27, [x0, 88]
fmadd d6, d0, d26, d6
fmadd d5, d0, d25, d5
ldp d26, d25, [x0, 104]
fmadd d4, d0, d28, d4
fmadd d3, d0, d27, d3
add x0, x0, 512
fmadd d2, d0, d26, d2
fmadd d1, d0, d25, d1
cmp x2, x19
bne L5
This runs at 0.2946 clock cycles.
These are not vector instructions are they? I wonder if these operate on one double at a time.
Tiger Lake makes it very easy to get a high percentage of peak, because it is very bottlenecked by lack of (FMA) execution resources:
julia> using LinuxPerf
julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> f(arg1, args...), 1:N)
foreachf (generic function with 1 method)
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(jgemvavx!, 1_000_000, y, A, x)
end
âââââââââââââââââââââââââââââââââââââââââââ
â cpu-cycles 5.20e+08 100.0% # 4.4 cycles per ns
â task-clock 1.19e+08 100.0% # 119.2 ms
â instructions 9.21e+08 100.0% # 1.8 insns per cycle
â branch-instructions 7.20e+07 100.0% # 7.8% of instructions
â branch-misses 1.00e+06 100.0% # 1.4% of branch instructions
â L1-dcache-load-misses 9.60e+03 100.0% # 0.0% of dcache loads
â L1-dcache-loads 5.96e+08 100.0%
â cache-misses 6.85e+02 100.0% # 79.3% of cache references
â cache-references 8.64e+02 100.0%
âââââââââââââââââââââââââââââââââââââââââââ
julia> 5.2e8 / (1_000_000 * N^2)
0.126953125
I switched to using LinuxPerf.jl
here because my laptop's clock speed is erratic, but I could of course also disable the turbo, which should pin it at roughly 2.8 GHz.
However, LinuxPerf also lets us calculate cycles/iteration more directly: we can just run it a million times, and then compare the number of clock cycles with N^2
.
LinuxPerf is of course Linux only. It's probably x86-only too (given it needs to run performance counters? I'd have to check if it supports ARM).
Interestingly, we hit 0.1269, meaning we're at around 98% of the theoretical peak!
Tiger Lake is a big upgrade over Skylake-X/Cascade lake cores in most respects, except that it only has a single fma unit. Thus, if Skylake-X/Cascade Lake can reach or exceed 1 FMA/cycle, the only reason Tiger Lake couldn't is because it is missing the extra unit. Being so starved on execution, anything bottlenecked by SIMD FMA makes it easy to hit the theoretical peak on Ice/Tiger/Rocket lake.
Ok, so the vectorization goes roughly like this if I understand it correctly. You start from: And the cost is 1 fma, and 1 memory read (R). For now we assume everything else can gets amortized (if not, we'll have to add it to the peak).
Yes, that's exactly right. The yi(k)
and x(j)
are of course hoisted out.
We can't totally amortize the x(j)
, because we're only getting 16 reuses here, but it goes a long way to ameliorating the 2 load-per-fma situation we have with the code as written.
These are not vector instructions are they? I wonder if these operate on one double at a time.
You're right.
Try -march=armv8.5-a
, although -march=armv8-a
should be enough.
Interesting. Yes, different platforms have different pros and cons, but plotting the clock cycles and also the peak into the same graph makes this easy to see how well each compiler is doing. The flops are not that very informative, especially since many of these benchmarks (depending on the platform!) are memory read limited.
I also tried lfortran --fast a.f90
on the manually unrolled code. I get 1.328 clock cycles. ;) The assembly is a mess, I think this is the relevant part:
LBB7_5:
ldr w10, [sp, #176]
add w8, w24, w30
add w23, w2, w9
ldr d0, [x28, w8, sxtw #3]
add w8, w23, #1
ldr d1, [x27, w8, sxtw #3]
sub w8, w13, w10
sbfiz x8, x8, #3, #32
ldr d2, [x11, x8]
fmul d1, d1, d0
add w10, w23, #2
add w30, w30, #1
fadd d1, d2, d1
str d1, [x11, x8]
ldr w8, [sp, #176]
ldr d1, [x27, w10, sxtw #3]
add w10, w23, #3
cmp w30, w25
sub w8, w14, w8
sbfiz x8, x8, #3, #32
ldr d2, [x11, x8]
fmul d1, d1, d0
add w9, w9, w26
fadd d1, d2, d1
str d1, [x11, x8]
ldr w8, [sp, #176]
ldr d1, [x27, w10, sxtw #3]
add w10, w23, #4
sub w8, w15, w8
sbfiz x8, x8, #3, #32
ldr d2, [x11, x8]
fmul d1, d1, d0
fadd d1, d2, d1
...
It's not using fmas. That's because I didn't figure out how to force the --fast-math
flag to LLVM from C++... One can use a workaround:
lfortran --show-llvm a.f90 > a.ll
clang -O3 -ffast-math -funroll-loops -c a.ll -o a.o
lfortran -o a.out a.o
Still 1.32 cycles and no fmas. Not sure what is going on.
Try
-march=armv8.5-a
, although-march=armv8-a
should be enough.
Ah, that's how you do it. I was always doing:
$ clang -O3 -march=native -ffast-math -funroll-loops -c a.ll -o a.o
clang-11: error: the clang compiler does not support '-march=native'
But this works:
clang -O3 -march=armv8.5-a -ffast-math -funroll-loops -c a.ll -o a.o
Still no fma. I also tried the same with gfortran, but no change either.
I think LFortran might generate LLVM for the indexing operations in a way that is hard to optimize for LLVM.
I always thought both MKL and OpenBLAS have the state of the art matrix-vector multiply. But I can't find their lines in the plot. But they don't seem to be doing well at all. Why is that?
My understanding is that Goto wrote it by hand in assembly for Pentium 4. I don't know if this is close to the original code:
- https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemv_n.S
It looks really complicated. Do you know what vectorization strategy he used? Due to the lack of fma, he had to very carefully hide the latency of the mul and then add, the older CPUs were not as good at reordering instructions and executing them at the same time. However Pentium 4 already has xmm registers, so I would think the strategy might be similar as today.
Update: I found some details here: https://dl.acm.org/doi/10.1145/1356052.1356053, although mostly for gemm
.
Now let's try C++ on the M1 for N=64.
First the original benchmark, as written, takes 1.77428 cycles:
void gemv4(double* y, double* A, double* x){
for (long m = 0; m < N; m++){
y[m] = 0.0;
}
for (long i = 0; i < N; i++){
for (long j = 0; j < N; j++){
y[i] += A[i + j*N] * x[j];
}
}
}
The manually unrolled loop is 0.24548 cycles:
void gemv5(double* y, double* A, double* x){
for (long m = 0; m < N; m++){
y[m] = 0.0;
}
double yi[16];
for (long i = 0; i < N; i+=16) {
for (long k = 0; k < 16; k++) yi[k] = 0.0;
for (long j = 0; j < N; j++) {
for (long k = 0; k < 16; k++) {
yi[k] += A[i+k + j*N] * x[j];
}
}
for (long k = 0; k < 16; k++) y[i+k] = yi[k];
}
}
If you transpose A, it takes 0.242369 clock cycles:
void gemv(double* y, double* A, double* x){
for (long m = 0; m < N; m++){
y[m] = 0.0;
}
for (long k = 0; k < N; k++){
for (long m = 0; m < N; m++){
y[m] += A[m + k*N] * x[k];
}
}
}
That's your code from: https://github.com/JuliaSIMD/LoopVectorization.jl/blob/134a95c84aacd74cfffae484918fd179e93fb9d9/benchmark/looptests.c#L154
I used the native Clang:
clang++ -std=c++17 -O3 -march=armv8.5-a -ffast-math d.cpp
The assembly for the gemv5
version:
LBB4_1: ; =>This Inner Loop Header: Depth=1
ldr d16, [x2, x8]
ldp q17, q18, [x9, #-64]
fmla.2d v0, v17, v16[0]
fmla.2d v1, v18, v16[0]
ldp q17, q18, [x9, #-32]
fmla.2d v2, v17, v16[0]
fmla.2d v3, v18, v16[0]
ldp q17, q18, [x9]
fmla.2d v4, v17, v16[0]
fmla.2d v5, v18, v16[0]
ldp q17, q18, [x9, #32]
fmla.2d v6, v17, v16[0]
fmla.2d v7, v18, v16[0]
add x8, x8, #8 ; =8
add x9, x9, #512 ; =512
cmp x8, #512 ; =512
b.ne LBB4_1
(I hope this is the right one.) It seems equivalent to the assembly from LV you posted above.