oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

cpu: aarch64: matmul: enable brgemm matmul bf16

Open michalowski-arm opened this issue 1 month ago • 4 comments

Description

This enables brgemm matmul bf16 for aarch64 for 2-3 dimensional shapes; below are some benchdnn time perf numbers for bf16 matmul on c8g with 32 threads (in ms, lower is better):

shape brg:sve_128 gemm:acl gemm:jit
16x16:16x16 0.00387 0.0251 0.00433
64x64:64x64 0.00394 0.0386 0.0430
128x128:128x128 0.00518 0.0384 0.316
512x512:512x512 0.0803 0.0801 10.7
1024x1024:1024x1024 0.625 0.393 72.9
4096x4096:4096x4096 41.6 23.6 4408

All tests (from the nightly test set) ran using ctest -R matmul were successful on both c7g and c8g instances.

michalowski-arm avatar Nov 12 '25 14:11 michalowski-arm

1024x1024:1024x1024: brg = 0.625, acl = 0.393

Any idea why this shape seems to behave so differently from the rest? It makes me wonder how the acl impl would stack up if we split 4096x4096 into 4 subtiles.

Sqvid avatar Nov 14 '25 11:11 Sqvid

1024x1024:1024x1024: brg = 0.625, acl = 0.393

Any idea why this shape seems to behave so differently from the rest?

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

michalowski-arm avatar Nov 14 '25 11:11 michalowski-arm

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?

Sqvid avatar Nov 14 '25 11:11 Sqvid

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?

We need to tread cautiously around heuristics around this, because I think jit_bf16_matmul is better than brgemm matmul, it just doesn't support many memory formats. So the picture around this is quite complicated and fast moving, it may not be worth the complexity of a heuristic until the dust settles

jondea avatar Dec 01 '25 14:12 jondea