XNNPACK Adding support for SME1 GEMM FP32 kernel

Adds support for SME1 for GEMM FP32 Kernel

Feb 18 '25 09:02 vgundlur

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Feb 18 '25 09:02 google-cla[bot]

could someone please help on the next steps for this PR?

Feb 24 '25 07:02 vgundlur

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

Mar 27 '25 20:03 dsharlet

This is just a wrapper?
xnn_pf32_gemm_minmax_ukernel_32x32__neonsme that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme?

// Wraps the xnn_pf32_gemm_minmax__asm_aarch64_neonsme // GEMM microkernel with a name that is compatible with our tooling. void xnn_pf32_gemm_minmax_ukernel_32x32__neonsme( size_t m, size_t n, size_t k, const void* lhs_packed, const void* rhs_packed, float* dst, size_t dst_stride_row, size_t dst_stride_col, union xnn_f32_minmax_params minmax_params[XNN_RESTRICT XNN_MIN_ELEMENTS(1)]) {

xnn_pf32_gemm_minmax__asm_aarch64_neonsme(lhs_packed, rhs_packed, dst, (k/sizeof(float)), &minmax_params->scalar.max,
                &minmax_params->scalar.min, m, n, NULL, 0, dst_stride_row);

}

I suspect xnn_pf32_gemm_minmax__asm_aarch64_neonsme requires KleidiAI so this will fail to build with kleidi disabled.

Apr 02 '25 02:04 fbarchard

Hi @fbarchard ,

Yes, xnn_pf32_gemm_minmax_ukernel_32x32__neonsme is a wrapper that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme. Also, xnn_pf32_gemm_minmax__asm_aarch64_neonsme does not require kleidiAI as it is available in source form within src/pf32-gemm/gen/pf32-gemm-32x32-minmax-asm-aarch64-neonsme.S.

However, we saw few build failures when KleidiAI was disabled. Fixes for these failures are added.

Apr 04 '25 06:04 vgundlur

@vgundlur can you please address the feedback I raised in this comment?

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

AFAICT these kernels are near identical, what makes them different? Why should we have both of these kernels? And if we really need both, can we do it in a way that doesn't copy paste a large block of assembly source code? (It might look like XNNPACK has a lot of copy/pasted kernels, but these are generated from the same source code wherever possible.)

Jun 03 '25 07:06 dsharlet

@dsharlet Thank you for commenting on the PR and sorry for the delay in responding your Query. Agree on the comment that both the kernels are identical except for some fixes for the unsupported instructions.

We need both versions of these kernels for platforms supporting SME version 1 and Version 2. Kleidiai supports SME2 and we are pushing for SME1 support. Actually, we have tried pushing directly to KleidiAI but there are contribution restrictions to the Kernels in KleidiAI and hence we were unable to push to that repo, Please see details here: https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CONTRIBUTING.md

We will try including the Original Kaleidiai implementation into our file and replace the unsupported instructions with a macro that modifies to supported instructions, Hope this will be good, if not Please suggest any other alternate approaches.

Jun 04 '25 07:06 vgundlur

@dsharlet, arm has pushed sme1 GEMM kernel to Kleidiai and we have pulled the GEMM kernel from it and updated our PR. Please help review the change and share your comments.

Jun 25 '25 09:06 vgundlur

Resolved the conflict which happened because of a merge into the master 11hrs back.

Jul 30 '25 09:07 vgundlur

@dsharlet could you please help to move this PR forward.

Aug 01 '25 06:08 vgundlur

This is failing tests in the same way that was fixed for SME2 kernels by this change: https://github.com/google/XNNPACK/pull/8750/files#diff-e7b59d3d757d18b4fff21fe85867107abe17cf9b324f474b79fbb39ca380ec6c

I think you just need to adjust that new logic in generate-gemm-test.py to work for both SME and SME2.

Aug 02 '25 04:08 dsharlet

@dsharlet, We have added the support for SME in generate-gemm-test.py, Please review the changes.

Aug 04 '25 09:08 vgundlur

@dsharlet, could you please help to merge this if there are no further comments.

Aug 07 '25 14:08 vgundlur

XNNPACK XNNPACK copied to clipboard

Adding support for SME1 GEMM FP32 kernel

XNNPACK
XNNPACK copied to clipboard