[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation

Open edgchen1 opened this issue 1 year ago • 1 comments

Description

Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., dividing the output into 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.

Also moved some code around as it was getting big for a single file.

Measurements

Baseline: 9eb1c2a7a3 Updated: e35f2b34b1

Microbenchmarks

Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096

blklen	symmetric	baseline time (ns)	updated time (ns)
16	1	76617120	51631561
16	0	83473648	58761985
32	1	35161580	29889143
32	0	42832905	33211246
64	1	35889788	33765620
64	0	38865041	31249219

E2E test

Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.

machine	baseline prompt processing tokens/second	updated pp t/s
Samsung Galaxy S21	11.95	15.86
Surface Pro 9	21.90	27.01
Azure VM	16.06	18.90

Motivation and Context

Improve prompt processing (M>1) performance.

Jun 27 '24 18:06 edgchen1

in microbenchmark measurements, why is blklen 64 asymmetric faster than symmetric?

edit: in 3d8fe4d symmetric is faster.

Jun 29 '24 02:06 edgchen1