oneDNN icon indicating copy to clipboard operation
oneDNN copied to clipboard

cpu: rv64: jit: add jit gemm kernel to improve matmul performance

Open zhangjian29 opened this issue 1 month ago • 5 comments

Description

This PR introduces a specialized jit_gemm_kernel based on xbyak_riscv to improve the performance of the existing rvv_gemm_f32 implementation on RV64.

Key Feature: Specialized RV64 jit_gemm_kernel Using xbyak_riscv

This PR makes the following changes:

  • Enables the xbyak_riscv JIT backend build flag together with the existing RVV intrinsics backend build flag for RV64 when the toolchain supports RVV.

  • Adds an RV64 jit_generator_t wrapper (similar in methods to the x64 and aarch64 JIT generators) to encapsulate basic JIT code emission using xbyak_riscv.

  • Adds a JIT-optimized f32 GEMM micro-kernel jit_rvv_gemm_kernel as part of rvv_gemm_utils for the most important GEMM configuration:

    • isTransA = false
    • isTransB = false
    • n_unroll = 4
  • The new jit_rvv_gemm_kernel implements:

    • A fixed 8×4 micro-tile: $$C[0:8, 0:4] = \alpha \cdot A[0:8, 0:K] \cdot B[0:K, 0:4] + \beta \cdot C[0:8, 0:4]$$
    • Column-major interpretation consistent with the existing RVV GEMM code:
      • $A(i, k) = A[i + k \cdot \text{lda}]$
      • $B(k, j) = B[k + j \cdot \text{ldb}]$
      • $C(i, j) = C[i + j \cdot \text{ldc}]$
    • RVV vectorization over the M dimension (8 rows) with a 4-way unrolled K loop and a software-pipelined load/FMA schedule.
  • In rvv_gemm_f32, for the specialization

    • isTransA = false
    • isTransB = false
    • n_unroll = 4

    we now delegate the inner 8×4 micro-kernel to jit_rvv_gemm_kernel instead of the hand-written RVV intrinsics loop. All higher-level blocking, threading and tail processing logic remains unchanged.

Checklist

General

  • [x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • [x] Have you formatted the code using clang-format?

Performance Improvements

  • [x] Have you submitted performance data that demonstrates performance improvements?

We evaluated the new jit_rvv_gemm_kernel through the rvv_matmul primitive (which has been verified to use the JIT kernel).

All measurements were taken on an SG2044 platform with fixed CPU affinity (taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:

  • Benchmark: benchdnn matmul workloads
  • Data type: f32
  • benchdnn mode: --mode=P

Results

On average, the JIT kernel improves performance by 1.27× over the existing RVV GEMM implementation.

The detailed per‑shape results are shown below.

Runtime Comparisons (Before vs After, and Speedup)

Batch Shape Before (ms) After (ms) Speedup (Before / After)
shapes_converted_ip_inf_lb_wd 202.78 156.59 1.29×
shapes_converted_ip_inf_lb_gmnt 26.41 22.17 1.19×
shapes_converted_ip_inf_lb_googlenet 255.53 196.46 1.30×
shapes_converted_ip_inf_lb_resnet 113.29 87.90 1.29×
shapes_transformer 145.42 130.74 1.11×
shapes_converted_ip_inf_lb_vgg16 4912.15 2786.26 1.76×
shapes_converted_ip_inf_lb_ncf 39.13 27.89 1.40×
shapes_converted_ip_inf_lb_alexnet 22442.10 18524.10 1.21×
shapes_converted_ip_inf_lb_maskrcnn 4277.70 3432.17 1.25×
shapes_converted_ip_inf_lb_rnn_t 4728.64 3891.70 1.22×
shapes_converted_ip_inf_lb_dlrm 1300.33 1007.73 1.29×
Total 38443.49 30263.71 1.27×

zhangjian29 avatar Dec 03 '25 09:12 zhangjian29

More improvements on conv computations

Same platforms and test conditions. The JIT kernel improves performance by 1.56× over the existing rvv_gemm_convolution implementation.

Batch Shape Before (ms) After (ms) Speedup (Before / After)
shapes_gemm 19255.9 14326.8 1.34×
shapes_googlenet_v3 23177.3 12308.2 1.88×
shapes_mobilenet 3432.73 1613.88 2.13×
shapes_resnet_50 34015.8 22381.4 1.52×
shapes_vgg_11 21579.9 14315.9 1.51×
Total 101461.63 64946.18 1.56×

zhangjian29 avatar Dec 03 '25 10:12 zhangjian29

Hi @zhangfeiv0 ,

Kindly seeking your suggestions on this PR.

zhangjian29 avatar Dec 10 '25 01:12 zhangjian29

I'm curious—why test with only a single core bound, rather than multiple cores?

zhangfeiv0 avatar Dec 10 '25 03:12 zhangfeiv0

I'm curious—why test with only a single core bound, rather than multiple cores?

There are many users on the test machine, so I selected a single idle CPU core to isolate it from interference by other user processes. I could select more like 8 cores to test multicore performance.

zhangjian29 avatar Dec 10 '25 03:12 zhangjian29

Multicore Performance Comparisons

  • Using taskset -c 32-39
Batch Shape Before This PR (ms) After This PR (ms) Speedup (Before / After)
shapes_converted_ip_inf_lb_wd 25.6739 23.0902 1.11×
shapes_converted_ip_inf_lb_gmnt 3.41655 3.02603 1.13×
shapes_converted_ip_inf_lb_googlenet 37.4437 30.5433 1.23×
shapes_converted_ip_inf_lb_resnet 33.4940 14.0115 2.39×
shapes_transformer 38.2065 19.6884 1.94×
Total 138.23465 90.35943 1.53×
Batch Shape Before This PR (ms) After This PR (ms) Speedup (Before / After)
shapes_gemm 3010.55 2829.25 1.06×
shapes_googlenet_v3 3685.88 3276.23 1.13×
shapes_mobilenet 610.151 487.503 1.25×
shapes_resnet_50 3959.14 3617.16 1.09×
shapes_vgg_11 3082.23 2860.39 1.08×
Total 14347.951 13070.533 1.10×

zhangjian29 avatar Dec 10 '25 09:12 zhangjian29