src: cpu: aarch64: add ACL s8:s8:f32 matmul

Open jondea opened this issue 9 months ago • 1 comments

Description

This PR adds an s8:s8:f32 matmul implementation using arm_compute::NEGEMMLowpMatrixMultiplyCore. For moderate sized problems the new implementation is

several orders of magnitude faster than gemm:jit
~3-4x faster than the pure f32 gemm:acl

We also bump the minimum ACL version to 24.04 because this is first version which supports runtime arm_compute:QuantizationInfo and contains the necessary s8:s8:f32 kernels. This isn't released yet, but will be in the coming days.

Checklist

General

[x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[x] Have you formatted the code using clang-format?

Performance improvements

[x] Have you submitted performance data that demonstrates performance improvements?

Apr 29 '24 12:04 jondea

Thanks for the review!

Do we need any additional test cases in benchdnn?

There seems to be good existing coverage in tests/benchdnn/inputs/matmul/test_matmul_ci that I used while developing.

Apr 30 '24 12:04 jondea

This one has a conflict. Could you, please, resolve it?

May 13 '24 07:05 dzarukin

oneDNN oneDNN copied to clipboard

src: cpu: aarch64: add ACL s8:s8:f32 matmul

Description

Checklist

General

Performance improvements

oneDNN
oneDNN copied to clipboard