oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

cpu: aarch64: KleidiAI int4 and fp32 kernels integration via BRGeMM oneDNN API

Open Radu2k opened this issue 9 months ago • 4 comments

Description

This pull request introduces and enables Arm® KleidiAI™ microkernels on AArch64 through BRGeMM oneDNN API, consisting of two commits to add the new functionality.

Specifically: cpu: aarch64: enable BRGeMM trough oneDNN API for AArch64

  • AArch64 BRGeMM oneDNN API Enablement: Enables the BRGeMM oneDNN API route on AArch64, ensuring that these newly introduced kernels can be leveraged on AArch64 hardware for improved int4/fp32 matrix multiplication performance while preserving compatibility.

cpu: aarch64: integrate KleidiAI trough oneDNN API

  • Integration of KleidiAI matmuls: Provides access through the BRGeMM interface to KleidiAI int4 channelwise + int4 groupwise kernels and fp32 kernels. It allows both full matrix multiplication and tile-based execution using vectors of (m_idx, n_idx) where m_idx, n_idx represent indexes for M respectively N given that we have pre-packed SRC(LHS) MxK and WEI(RHS) KxN.
  • Integration of KleidiAI packing functions: Expand the oneDNN API Transform functionality, allowing fused int4 quantisation+packing for SRC(LHS) and int4/fp32 WEI(RHS) packing.
  • Documentation and Benchdnn Updates: Reflect the new KleidiAI integration and enable testing for fp32 kernels using KleidiAI kernels.

Checklist

General

  • [ ] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [x] Have you formatted the code using clang-format?

Failing tests: The following tests FAILED: 168 - test_graph_unit_dnnl_large_partition_cpu (Failed) 180 - test_graph_unit_dnnl_sdp_decomp_cpu (Failed) 191 - test_benchdnn_modeC_binary_ci_cpu (Subprocess aborted) 192 - test_benchdnn_modeC_binary_different_dt_ci_cpu (Subprocess aborted) 200 - test_benchdnn_modeC_graph_ci_cpu (Subprocess aborted)

Performance improvements

  • [ ] Have you submitted performance data that demonstrates performance improvements?

New features

  • [ ] Have you published an RFC for the new feature?
  • [ ] Was the RFC approved?
  • [ ] Have you added relevant tests?

Radu2k avatar Mar 06 '25 20:03 Radu2k

Hi team, please check this PR out: https://github.com/oneapi-src/oneDNN/pull/2862 If you could rebase on top and try to put implementation underneath the identical way and provide the feedback, that would be great.

The change targets easier maintainability.

dzarukin avatar Mar 12 '25 02:03 dzarukin

Adding short summary of offline discussion:

oneDNN ukernel API philosophy is to allow simple interoperability with custom user code. Hence, data layouts are transparent to user, and can be queried with static methods.

IIUC, core KleidiAI seems to be compatible with those principles except for some functionalities that require opaque layouts. The ones used in the PR are:

  • bias fusion, as it requires packing the bias with the weights (both interleaved in memory).
  • dynamic quantization, as it requires packing data with quantization parameters (interleaved in memory).

For bias fusion, I would recommend to call KleidiAI without bias to avoid the interleaved weights/bias packing, and process the bias as a post-op. This should allow to keep all data layout transparent to user. Regarding the topic of dynamic quantization in general, we can evaluate the options as part of a separate RFC.

Hope I captured the discussion properly, @Radu2k @dzarukin please don't hesitate to correct.

mgouicem avatar Mar 14 '25 13:03 mgouicem

@mgouicem Thanks for the summary. I think that is an accurate recap of the main points. We will discuss internally and see if there is a balance that can be struck between the flexibility-first design of the BRGeMM API vs the performance-first design of the Kleidi API.

Sqvid avatar Mar 18 '25 10:03 Sqvid

@Radu2k can this PR be closed as stale?

Sqvid avatar Oct 15 '25 10:10 Sqvid