cpu: rv64: jit: add jit gemm kernel to improve matmul performance
Description
This PR introduces a specialized jit_gemm_kernel based on xbyak_riscv to improve the performance of the existing rvv_gemm_f32 implementation on RV64.
Key Feature: Specialized RV64 jit_gemm_kernel Using xbyak_riscv
This PR makes the following changes:
-
Enables the
xbyak_riscvJIT backend build flag together with the existing RVV intrinsics backend build flag for RV64 when the toolchain supports RVV. -
Adds an RV64
jit_generator_twrapper (similar in methods to thex64andaarch64JIT generators) to encapsulate basic JIT code emission usingxbyak_riscv. -
Adds a JIT-optimized f32 GEMM micro-kernel
jit_rvv_gemm_kernelas part ofrvv_gemm_utilsfor the most important GEMM configuration:isTransA = falseisTransB = falsen_unroll = 4
-
The new
jit_rvv_gemm_kernelimplements:- A fixed 8×4 micro-tile: $$C[0:8, 0:4] = \alpha \cdot A[0:8, 0:K] \cdot B[0:K, 0:4] + \beta \cdot C[0:8, 0:4]$$
- Column-major interpretation consistent with the existing RVV GEMM code:
- $A(i, k) = A[i + k \cdot \text{lda}]$
- $B(k, j) = B[k + j \cdot \text{ldb}]$
- $C(i, j) = C[i + j \cdot \text{ldc}]$
- RVV vectorization over the M dimension (8 rows) with a 4-way unrolled K loop and a software-pipelined load/FMA schedule.
-
In
rvv_gemm_f32, for the specializationisTransA = falseisTransB = falsen_unroll = 4
we now delegate the inner 8×4 micro-kernel to
jit_rvv_gemm_kernelinstead of the hand-written RVV intrinsics loop. All higher-level blocking, threading and tail processing logic remains unchanged.
Checklist
General
- [x] Do all unit and benchdnn tests (
make testandmake test_benchdnn_*) pass locally for each commit? - [x] Have you formatted the code using clang-format?
Performance Improvements
- [x] Have you submitted performance data that demonstrates performance improvements?
We evaluated the new jit_rvv_gemm_kernel through the rvv_matmul primitive (which has been verified to use the JIT kernel).
All measurements were taken on an SG2044 platform with fixed CPU affinity (taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:
- Benchmark:
benchdnnmatmul workloads - Data type:
f32 - benchdnn mode:
--mode=P
Results
On average, the JIT kernel improves performance by 1.27× over the existing RVV GEMM implementation.
The detailed per‑shape results are shown below.
Runtime Comparisons (Before vs After, and Speedup)
| Batch Shape | Before (ms) | After (ms) | Speedup (Before / After) |
|---|---|---|---|
| shapes_converted_ip_inf_lb_wd | 202.78 | 156.59 | 1.29× |
| shapes_converted_ip_inf_lb_gmnt | 26.41 | 22.17 | 1.19× |
| shapes_converted_ip_inf_lb_googlenet | 255.53 | 196.46 | 1.30× |
| shapes_converted_ip_inf_lb_resnet | 113.29 | 87.90 | 1.29× |
| shapes_transformer | 145.42 | 130.74 | 1.11× |
| shapes_converted_ip_inf_lb_vgg16 | 4912.15 | 2786.26 | 1.76× |
| shapes_converted_ip_inf_lb_ncf | 39.13 | 27.89 | 1.40× |
| shapes_converted_ip_inf_lb_alexnet | 22442.10 | 18524.10 | 1.21× |
| shapes_converted_ip_inf_lb_maskrcnn | 4277.70 | 3432.17 | 1.25× |
| shapes_converted_ip_inf_lb_rnn_t | 4728.64 | 3891.70 | 1.22× |
| shapes_converted_ip_inf_lb_dlrm | 1300.33 | 1007.73 | 1.29× |
| Total | 38443.49 | 30263.71 | 1.27× |
More improvements on conv computations
Same platforms and test conditions. The JIT kernel improves performance by 1.56× over the existing rvv_gemm_convolution implementation.
| Batch Shape | Before (ms) | After (ms) | Speedup (Before / After) |
|---|---|---|---|
| shapes_gemm | 19255.9 | 14326.8 | 1.34× |
| shapes_googlenet_v3 | 23177.3 | 12308.2 | 1.88× |
| shapes_mobilenet | 3432.73 | 1613.88 | 2.13× |
| shapes_resnet_50 | 34015.8 | 22381.4 | 1.52× |
| shapes_vgg_11 | 21579.9 | 14315.9 | 1.51× |
| Total | 101461.63 | 64946.18 | 1.56× |
Hi @zhangfeiv0 ,
Kindly seeking your suggestions on this PR.
I'm curious—why test with only a single core bound, rather than multiple cores?
I'm curious—why test with only a single core bound, rather than multiple cores?
There are many users on the test machine, so I selected a single idle CPU core to isolate it from interference by other user processes. I could select more like 8 cores to test multicore performance.
Multicore Performance Comparisons
- Using
taskset -c 32-39
| Batch Shape | Before This PR (ms) | After This PR (ms) | Speedup (Before / After) |
|---|---|---|---|
| shapes_converted_ip_inf_lb_wd | 25.6739 | 23.0902 | 1.11× |
| shapes_converted_ip_inf_lb_gmnt | 3.41655 | 3.02603 | 1.13× |
| shapes_converted_ip_inf_lb_googlenet | 37.4437 | 30.5433 | 1.23× |
| shapes_converted_ip_inf_lb_resnet | 33.4940 | 14.0115 | 2.39× |
| shapes_transformer | 38.2065 | 19.6884 | 1.94× |
| Total | 138.23465 | 90.35943 | 1.53× |
| Batch Shape | Before This PR (ms) | After This PR (ms) | Speedup (Before / After) |
|---|---|---|---|
| shapes_gemm | 3010.55 | 2829.25 | 1.06× |
| shapes_googlenet_v3 | 3685.88 | 3276.23 | 1.13× |
| shapes_mobilenet | 610.151 | 487.503 | 1.25× |
| shapes_resnet_50 | 3959.14 | 3617.16 | 1.09× |
| shapes_vgg_11 | 3082.23 | 2860.39 | 1.08× |
| Total | 14347.951 | 13070.533 | 1.10× |