oneDNN
oneDNN copied to clipboard
src: cpu: aarch64: add ACL s8:s8:f32 matmul
Description
This PR adds an s8:s8:f32 matmul implementation using arm_compute::NEGEMMLowpMatrixMultiplyCore
. For moderate sized problems the new implementation is
- several orders of magnitude faster than
gemm:jit
- ~3-4x faster than the pure
f32
gemm:acl
We also bump the minimum ACL version to 24.04 because this is first version which supports runtime arm_compute:QuantizationInfo
and contains the necessary s8:s8:f32 kernels. This isn't released yet, but will be in the coming days.
Checklist
General
- [x] Do all unit and benchdnn tests (
make test
andmake test_benchdnn_*
) pass locally for each commit? - [x] Have you formatted the code using clang-format?
Performance improvements
- [x] Have you submitted performance data that demonstrates performance improvements?
Thanks for the review!
Do we need any additional test cases in benchdnn?
There seems to be good existing coverage in tests/benchdnn/inputs/matmul/test_matmul_ci
that I used while developing.
This one has a conflict. Could you, please, resolve it?