cpu: aarch64: matmul: enable brgemm matmul bf16

Open michalowski-arm opened this issue 1 month ago • 4 comments

Description

This enables brgemm matmul bf16 for aarch64 for 2-3 dimensional shapes; below are some benchdnn time perf numbers for bf16 matmul on c8g with 32 threads (in ms, lower is better):

shape	brg:sve_128	gemm:acl	gemm:jit
16x16:16x16	0.00387	0.0251	0.00433
64x64:64x64	0.00394	0.0386	0.0430
128x128:128x128	0.00518	0.0384	0.316
512x512:512x512	0.0803	0.0801	10.7
1024x1024:1024x1024	0.625	0.393	72.9
4096x4096:4096x4096	41.6	23.6	4408

All tests (from the nightly test set) ran using ctest -R matmul were successful on both c7g and c8g instances.

Nov 12 '25 14:11 michalowski-arm

1024x1024:1024x1024: brg = 0.625, acl = 0.393

Any idea why this shape seems to behave so differently from the rest? It makes me wonder how the acl impl would stack up if we split 4096x4096 into 4 subtiles.

Nov 14 '25 11:11 Sqvid

1024x1024:1024x1024: brg = 0.625, acl = 0.393

Any idea why this shape seems to behave so differently from the rest?

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

Nov 14 '25 11:11 michalowski-arm

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?

Nov 14 '25 11:11 Sqvid

If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.

That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?

We need to tread cautiously around heuristics around this, because I think jit_bf16_matmul is better than brgemm matmul, it just doesn't support many memory formats. So the picture around this is quite complicated and fast moving, it may not be worth the complexity of a heuristic until the dust settles

Dec 01 '25 14:12 jondea