onnxruntime icon indicating copy to clipboard operation
onnxruntime copied to clipboard

[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation

Open edgchen1 opened this issue 1 year ago • 1 comments

Description

Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., dividing the output into 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.

Measurements

Baseline: 9eb1c2a7a3 Updated: e35f2b34b1

Microbenchmarks

Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096

blklen symmetric baseline time (ns) updated time (ns)
16 1 76617120 51631561
16 0 83473648 58761985
32 1 35161580 29889143
32 0 42832905 33211246
64 1 35889788 33765620
64 0 38865041 31249219
E2E test

Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.

machine baseline prompt processing tokens/second updated pp t/s
Samsung Galaxy S21 11.95 15.86
Surface Pro 9 21.90 27.01
Azure VM 16.06 18.90

Motivation and Context

Improve prompt processing (M>1) performance.

edgchen1 avatar Jun 27 '24 18:06 edgchen1

in microbenchmark measurements, why is blklen 64 asymmetric faster than symmetric?

edit: in 3d8fe4d symmetric is faster.

edgchen1 avatar Jun 29 '24 02:06 edgchen1