Enable SME for sgemm and sbgemm through KleidiAI
Description
Enables Arm® KleidiAI™ SME kernels for MLAS sgemm and sbgemm functions.
Motivation and Context
These kernels provide performance improvements on SME-enabled devices. We see a performance improvement of 1.2x-1.8x on onnxruntime_perf_test for the following Geekbench models on M4:
Model Speedup
deeplabv3_mobilenetv2_f16.onnx 1.79x
bert_tiny_f16.onnx 1.47x
deeplabv3_mobilenetv2_f32.onnx 1.43x
mobilenetv1_ssd_f16.onnx 1.29x
mobilenet_v1_f32.onnx 1.28x
mobilenetv1_ssd_f32.onnx 1.26x
de_efficientnetlitev3_f16.onnx 1.25x
mobilenet_v1_f16.onnx 1.23x
Can workflows be approved please?
Can workflows be approved please?
Would you mind sharing some measurements to give an idea of how much these changes improve the performance?
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline
Azure Pipelines successfully started running 5 pipeline(s).
I've added performance figures to the PR description.
@MichaelTylerArm your branch has conflicts. can it be updated? Thanks!
Hi George, Hariharan, apologies for the delay in responding to the above. So a few developments since this PR was reviewed. We have a new merge candidate under a proposed MLAS architectural change that was communicated to Microsoft. Following that initial communication there has additional feedback with a proposal to create a struct with function pointers that may help alleviate the MlasGemmPackB bloat concern described above. I understand Ronan on our side is looking to create a discussion to firm up on this proposal for both ARM and MSFT. We will take on board the additional comments provided the above and work them into our new branch. In the meantime, I propose we close this PR pending the new PR reflecting the agreed changes. Thank you both.
Assuming this PR is not relevant any more after https://github.com/microsoft/onnxruntime/pull/25187 ? So, closing this for now. If relevant, please re-open. Thanks.