XNNPACK icon indicating copy to clipboard operation
XNNPACK copied to clipboard

Adding support for SME1 GEMM FP32 kernel

Open vgundlur opened this issue 9 months ago • 7 comments

Adds support for SME1 for GEMM FP32 Kernel

vgundlur avatar Feb 18 '25 09:02 vgundlur

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

google-cla[bot] avatar Feb 18 '25 09:02 google-cla[bot]

could someone please help on the next steps for this PR?

vgundlur avatar Feb 24 '25 07:02 vgundlur

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

dsharlet avatar Mar 27 '25 20:03 dsharlet

This is just a wrapper?
xnn_pf32_gemm_minmax_ukernel_32x32__neonsme that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme?

// Wraps the xnn_pf32_gemm_minmax__asm_aarch64_neonsme // GEMM microkernel with a name that is compatible with our tooling. void xnn_pf32_gemm_minmax_ukernel_32x32__neonsme( size_t m, size_t n, size_t k, const void* lhs_packed, const void* rhs_packed, float* dst, size_t dst_stride_row, size_t dst_stride_col, union xnn_f32_minmax_params minmax_params[XNN_RESTRICT XNN_MIN_ELEMENTS(1)]) {

xnn_pf32_gemm_minmax__asm_aarch64_neonsme(lhs_packed, rhs_packed, dst, (k/sizeof(float)), &minmax_params->scalar.max,
                &minmax_params->scalar.min, m, n, NULL, 0, dst_stride_row);    

}

I suspect xnn_pf32_gemm_minmax__asm_aarch64_neonsme requires KleidiAI so this will fail to build with kleidi disabled.

fbarchard avatar Apr 02 '25 02:04 fbarchard

Hi @fbarchard ,

Yes, xnn_pf32_gemm_minmax_ukernel_32x32__neonsme is a wrapper that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme. Also, xnn_pf32_gemm_minmax__asm_aarch64_neonsme does not require kleidiAI as it is available in source form within src/pf32-gemm/gen/pf32-gemm-32x32-minmax-asm-aarch64-neonsme.S.

However, we saw few build failures when KleidiAI was disabled. Fixes for these failures are added.

vgundlur avatar Apr 04 '25 06:04 vgundlur

@vgundlur can you please address the feedback I raised in this comment?

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

AFAICT these kernels are near identical, what makes them different? Why should we have both of these kernels? And if we really need both, can we do it in a way that doesn't copy paste a large block of assembly source code? (It might look like XNNPACK has a lot of copy/pasted kernels, but these are generated from the same source code wherever possible.)

dsharlet avatar Jun 03 '25 07:06 dsharlet

@dsharlet Thank you for commenting on the PR and sorry for the delay in responding your Query. Agree on the comment that both the kernels are identical except for some fixes for the unsupported instructions.

We need both versions of these kernels for platforms supporting SME version 1 and Version 2. Kleidiai supports SME2 and we are pushing for SME1 support. Actually, we have tried pushing directly to KleidiAI but there are contribution restrictions to the Kernels in KleidiAI and hence we were unable to push to that repo, Please see details here: https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CONTRIBUTING.md

We will try including the Original Kaleidiai implementation into our file and replace the unsupported instructions with a macro that modifies to supported instructions, Hope this will be good, if not Please suggest any other alternate approaches.

vgundlur avatar Jun 04 '25 07:06 vgundlur

@dsharlet, arm has pushed sme1 GEMM kernel to Kleidiai and we have pulled the GEMM kernel from it and updated our PR. Please help review the change and share your comments.

vgundlur avatar Jun 25 '25 09:06 vgundlur

Resolved the conflict which happened because of a merge into the master 11hrs back.

vgundlur avatar Jul 30 '25 09:07 vgundlur

@dsharlet could you please help to move this PR forward.

vgundlur avatar Aug 01 '25 06:08 vgundlur

This is failing tests in the same way that was fixed for SME2 kernels by this change: https://github.com/google/XNNPACK/pull/8750/files#diff-e7b59d3d757d18b4fff21fe85867107abe17cf9b324f474b79fbb39ca380ec6c

I think you just need to adjust that new logic in generate-gemm-test.py to work for both SME and SME2.

dsharlet avatar Aug 02 '25 04:08 dsharlet

@dsharlet, We have added the support for SME in generate-gemm-test.py, Please review the changes.

vgundlur avatar Aug 04 '25 09:08 vgundlur

@dsharlet, could you please help to merge this if there are no further comments.

vgundlur avatar Aug 07 '25 14:08 vgundlur