onnxruntime Enable SME for sgemm and sbgemm through KleidiAI

Description

Enables Arm® KleidiAI™ SME kernels for MLAS sgemm and sbgemm functions.

Motivation and Context

These kernels provide performance improvements on SME-enabled devices. We see a performance improvement of 1.2x-1.8x on onnxruntime_perf_test for the following Geekbench models on M4:

Model                           Speedup
deeplabv3_mobilenetv2_f16.onnx    1.79x
bert_tiny_f16.onnx                1.47x
deeplabv3_mobilenetv2_f32.onnx    1.43x
mobilenetv1_ssd_f16.onnx          1.29x
mobilenet_v1_f32.onnx             1.28x
mobilenetv1_ssd_f32.onnx          1.26x
de_efficientnetlitev3_f16.onnx    1.25x
mobilenet_v1_f16.onnx             1.23x

Apr 08 '25 14:04 MichaelTylerArm

Can workflows be approved please?

Apr 08 '25 14:04 MichaelTylerArm

Can workflows be approved please?

Apr 10 '25 20:04 MichaelTylerArm

Would you mind sharing some measurements to give an idea of how much these changes improve the performance?

Apr 10 '25 21:04 edgchen1

/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline,Windows x64 QNN CI Pipeline

Apr 10 '25 21:04 edgchen1

Azure Pipelines successfully started running 5 pipeline(s).

Apr 10 '25 21:04 azure-pipelines[bot]

I've added performance figures to the PR description.

May 12 '25 10:05 MichaelTylerArm

@MichaelTylerArm your branch has conflicts. can it be updated? Thanks!

Jun 04 '25 15:06 jywu-msft

Hi George, Hariharan, apologies for the delay in responding to the above. So a few developments since this PR was reviewed. We have a new merge candidate under a proposed MLAS architectural change that was communicated to Microsoft. Following that initial communication there has additional feedback with a proposal to create a struct with function pointers that may help alleviate the MlasGemmPackB bloat concern described above. I understand Ronan on our side is looking to create a discussion to firm up on this proposal for both ARM and MSFT. We will take on board the additional comments provided the above and work them into our new branch. In the meantime, I propose we close this PR pending the new PR reflecting the agreed changes. Thank you both.

Jun 05 '25 11:06 damdoo01-arm

Assuming this PR is not relevant any more after https://github.com/microsoft/onnxruntime/pull/25187 ? So, closing this for now. If relevant, please re-open. Thanks.

Aug 11 '25 23:08 hariharans29