cpu: aarch64: matmul: enable brgemm matmul bf16
Description
This enables brgemm matmul bf16 for aarch64 for 2-3 dimensional shapes; below are some benchdnn time perf numbers for bf16 matmul on c8g with 32 threads (in ms, lower is better):
| shape | brg:sve_128 | gemm:acl | gemm:jit |
|---|---|---|---|
| 16x16:16x16 | 0.00387 | 0.0251 | 0.00433 |
| 64x64:64x64 | 0.00394 | 0.0386 | 0.0430 |
| 128x128:128x128 | 0.00518 | 0.0384 | 0.316 |
| 512x512:512x512 | 0.0803 | 0.0801 | 10.7 |
| 1024x1024:1024x1024 | 0.625 | 0.393 | 72.9 |
| 4096x4096:4096x4096 | 41.6 | 23.6 | 4408 |
All tests (from the nightly test set) ran using ctest -R matmul were successful on both c7g and c8g instances.
1024x1024:1024x1024: brg = 0.625, acl = 0.393
Any idea why this shape seems to behave so differently from the rest? It makes me wonder how the acl impl would stack up if we split 4096x4096 into 4 subtiles.
1024x1024:1024x1024: brg = 0.625, acl = 0.393
Any idea why this shape seems to behave so differently from the rest?
If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.
If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.
That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?
If you mean why does brgemm become slower than acl:gemm here, it's because brgemm currently uses bfdot instruction instead of bfmmla. It seems that makes the difference from 512:512:512 onwards.
That makes sense. Can we put in a heuristic to dispatch based on size+bf16 then (along with a comment)?
We need to tread cautiously around heuristics around this, because I think jit_bf16_matmul is better than brgemm matmul, it just doesn't support many memory formats. So the picture around this is quite complicated and fast moving, it may not be worth the complexity of a heuristic until the dust settles