applegpu icon indicating copy to clipboard operation
applegpu copied to clipboard

SIMD futures on A16/M3 GPU

Open philipturner opened this issue 1 year ago • 0 comments

NDArrayMatrixMultiplyA16 does not contain simd async copy instructions, although the kernel for A14 does. Starting with AGX3 (A15), there are some new instructions used for GEMM and Conv. I haven't checked whether they're accessible from __asm (SIMD futures are not).

; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i16.v2i16(ptr addrspace(1), <2 x i16>, i16, i16, i16) #7

; Function Attrs: nounwind memory(write)
declare void @llvm.agx3.store.with.emask.global.i32.v2i32(ptr addrspace(1), <2 x i32>, i16, i16, i16) #7

; Function Attrs: nounwind speculatable memory(none)
declare i16 @llvm.agx3.edgecheck(i32, i32, i32) #8

; Function Attrs: nounwind memory(read)
declare <2 x i16> @llvm.agx3.load.with.emask.global.v2i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <2 x i32> @llvm.agx3.load.with.emask.global.v2i32.i32(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <1 x i16> @llvm.agx3.load.with.emask.global.v1i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <1 x i32> @llvm.agx3.load.with.emask.global.v1i32.i32(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <4 x i16> @llvm.agx3.load.with.emask.global.v4i16.i16(ptr addrspace(1), i16, i16, i16) #10

; Function Attrs: nounwind memory(read)
declare <4 x i32> @llvm.agx3.load.with.emask.global.v4i32.i32(ptr addrspace(1), i16, i16, i16) #10

Furthermore, unlike A14/M1, at least A16 can access 65536 bytes of registers from a single SIMD-group. That is more than physically possible.

  %31 = alloca [16 x [16 x %"struct.metal::simdgroup_matrix"]], align 256
  call void @llvm.lifetime.end.p0(i64 65536, ptr nonnull %292) #14

Luckily, SIMD futures run correctly and performantly on A15/A16. I do worry that this MPS kernel is referencing their unreleased A16 ray tracing GPU (or the in-development M3), which might remove support for SIMD futures.

Source: https://gist.github.com/philipturner/939d4ffda26e66f10a142c82d8d498e9

Results (A15)

GEMM dimensions: 256x256x256
2023-06-09 12:16:16.984966-0400 SIMDFuturesA15[32193:1233147] Metal GPU Frame Capture Enabled
2023-06-09 12:16:16.985621-0400 SIMDFuturesA15[32193:1233147] Metal API Validation Enabled

Metal FlashAttention: 'f16'
GFLOPS: 269
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 48
MFA vs MPS Euclidean Distance: 30.969585

GEMM dimensions: 512x512x512

Metal FlashAttention: 'f16'
GFLOPS: 465
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 249
MFA vs MPS Euclidean Distance: 209.93332

GEMM dimensions: 768x768x768

Metal FlashAttention: 'f16'
GFLOPS: 948
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 749
MFA vs MPS Euclidean Distance: 894.5818

GEMM dimensions: 1024x1024x1024

Metal FlashAttention: 'f32'
GFLOPS: 1215
Metal Performance Shaders: 'f32'
GFLOPS: 1353
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1184
Metal Performance Shaders: 'f16'
GFLOPS: 1265
MFA vs MPS Euclidean Distance: 2008.9558

Metal FlashAttention: 'f16'
GFLOPS: 1262
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1194
MFA vs MPS Euclidean Distance: 2009.3207

GEMM dimensions: 1280x1280x1280

Metal FlashAttention: 'f16'
GFLOPS: 1618
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1972
MFA vs MPS Euclidean Distance: 5536.6294

GEMM dimensions: 1536x1536x1536

Metal FlashAttention: 'f16'
GFLOPS: 1611
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2177
MFA vs MPS Euclidean Distance: 10459.943

GEMM dimensions: 1792x1792x1792

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2355
MFA vs MPS Euclidean Distance: 16553.246

GEMM dimensions: 2048x2048x2048

Metal FlashAttention: 'f32'
GFLOPS: 1397
Metal Performance Shaders: 'f32'
GFLOPS: 1326
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16'
GFLOPS: 1303
MFA vs MPS Euclidean Distance: 24126.244

Metal FlashAttention: 'f16'
GFLOPS: 1624
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2405
MFA vs MPS Euclidean Distance: 24127.777

GEMM dimensions: 4096x4096x4096

Metal FlashAttention: 'f16'
GFLOPS: 1594
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2941
MFA vs MPS Euclidean Distance: 305298.94
Results (A16)
GEMM dimensions: 256x256x256
2023-06-09 08:53:08.328381-0700 TestUI[48119:10608193] Metal GPU Frame Capture Enabled
2023-06-09 08:53:08.328473-0700 TestUI[48119:10608193] Metal API Validation Enabled
2023-06-09 08:53:09.303047-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.303155-0700 TestUI[48119:10608193] Errors found! Invalidating cache...
2023-06-09 08:53:09.346694-0700 TestUI[48119:10608193] fopen failed for data file: errno = 2 (No such file or directory)
2023-06-09 08:53:09.346763-0700 TestUI[48119:10608193] Errors found! Invalidating cache...

Metal FlashAttention: 'f16'
GFLOPS: 632
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 68
MFA vs MPS Euclidean Distance: 30.969585

GEMM dimensions: 512x512x512

Metal FlashAttention: 'f16'
GFLOPS: 378
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 387
MFA vs MPS Euclidean Distance: 209.93332

GEMM dimensions: 768x768x768

Metal FlashAttention: 'f16'
GFLOPS: 957
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1087
MFA vs MPS Euclidean Distance: 894.5818

GEMM dimensions: 1024x1024x1024

Metal FlashAttention: 'f32'
GFLOPS: 1163
Metal Performance Shaders: 'f32'
GFLOPS: 1070
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1130
Metal Performance Shaders: 'f16'
GFLOPS: 1273
MFA vs MPS Euclidean Distance: 2008.9558

Metal FlashAttention: 'f16'
GFLOPS: 1401
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 1949
MFA vs MPS Euclidean Distance: 2009.3207

GEMM dimensions: 1280x1280x1280

Metal FlashAttention: 'f16'
GFLOPS: 1606
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 2799
MFA vs MPS Euclidean Distance: 5536.6294

GEMM dimensions: 1536x1536x1536

Metal FlashAttention: 'f16'
GFLOPS: 1610
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3306
MFA vs MPS Euclidean Distance: 10459.943

GEMM dimensions: 1792x1792x1792

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3159
MFA vs MPS Euclidean Distance: 16553.246

GEMM dimensions: 2048x2048x2048

Metal FlashAttention: 'f32'
GFLOPS: 1474
Metal Performance Shaders: 'f32'
GFLOPS: 1379
MFA vs MPS Euclidean Distance: 0.0

Metal FlashAttention: 'f16'
GFLOPS: 1623
Metal Performance Shaders: 'f16'
GFLOPS: 1305
MFA vs MPS Euclidean Distance: 24126.244

Metal FlashAttention: 'f16'
GFLOPS: 1622
Metal Performance Shaders: 'f16 (either ANE or GPU tensor cores)'
GFLOPS: 3484
MFA vs MPS Euclidean Distance: 24127.777

GEMM dimensions: 4096x4096x4096

philipturner avatar Jun 09 '23 16:06 philipturner