onnxruntime
onnxruntime copied to clipboard
[MLAS] AArch64 SQNBitGemm CompInt8 initial multi-row implementation
Description
Update AArch64 SQNBitGemm CompInt8 kernels to process matrix in tiles. E.g., dividing the output into 2x2 tiles allows us to compute four elements of the output with one read of two rows of A and two columns of B.
Also moved some code around as it was getting big for a single file.
Measurements
Baseline: 9eb1c2a7a3 Updated: e35f2b34b1
Microbenchmarks
Run on Azure VM (ARM64 Linux) with compute type: CompInt8, number of threads: 4, M:128/K:4096/N:4096
| blklen | symmetric | baseline time (ns) | updated time (ns) |
|---|---|---|---|
| 16 | 1 | 76617120 | 51631561 |
| 16 | 0 | 83473648 | 58761985 |
| 32 | 1 | 35161580 | 29889143 |
| 32 | 0 | 42832905 | 33211246 |
| 64 | 1 | 35889788 | 33765620 |
| 64 | 0 | 38865041 | 31249219 |
E2E test
Run onnxruntime-genai benchmark with Phi-3 mini using 4 threads.
| machine | baseline prompt processing tokens/second | updated pp t/s |
|---|---|---|
| Samsung Galaxy S21 | 11.95 | 15.86 |
| Surface Pro 9 | 21.90 | 27.01 |
| Azure VM | 16.06 | 18.90 |
Motivation and Context
Improve prompt processing (M>1) performance.
in microbenchmark measurements, why is blklen 64 asymmetric faster than symmetric?
edit: in 3d8fe4d symmetric is faster.