candle
candle copied to clipboard
[AArch64] Quantized MatMul performance improvement
This change implements an new matmul based on Armv8.6 i8mm instructions. As the change requires a nightly compiler, using the more performant version is optional, under the new arm-nightly-feat feature flag.
I have also added the Armv8.4 dotprod instructions under this flag.
Performance improvement:
- +10% with only dotprod enabled (measured on LLaMa2)
- +40-60% for i8mm based matmul (measured on quantized Whisper, different cores produced different results)