Fuse sum post op acl matmul

Open jondea opened this issue 9 months ago • 2 comments

Description

Fuse the sum post op in acl matmul by setting the accumulate flag to true in arm_compute::GEMMInfo. This speeds up the post op and saves allocating a temporary dst sized tensor.

For example

OMP_NUM_THREADS=16 ./benchdnn --mode=p --matmul --attr-post-ops=sum,sum+relu:0 1024x256:256x2048

is ~12% faster and uses 1.2kB less peak memory.

We also added _for_sum to use_dst_acc flag to stop it being confused with the dst_acc used for transposing.

Change the way we deal with fused eltwise (as well as the new sum) to fix segfaults when binary ops followed fused ops.

This PR includes the commit from #1889 because it builds on top of it.

Checklist

General

[x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[x] Have you formatted the code using clang-format?

Performance improvements

[x] Have you submitted performance data that demonstrates performance improvements?

May 03 '24 15:05 jondea

Please don't submit this yet, we've identified a small issue and a fix is on the way

May 08 '24 14:05 jondea

Fixed the failure, happy for this to be merged

May 09 '24 14:05 jondea

oneDNN oneDNN copied to clipboard

Fuse sum post op acl matmul

Description

Checklist

General

Performance improvements

oneDNN
oneDNN copied to clipboard