Description

This PR introduces a specialized jit_gemm_kernel based on xbyak_riscv to improve the performance of the existing rvv_gemm_f32 implementation on RV64.

Key Feature: Specialized RV64 `jit_gemm_kernel` Using `xbyak_riscv`

This PR makes the following changes:

Enables the xbyak_riscv JIT backend build flag together with the existing RVV intrinsics backend build flag for RV64 when the toolchain supports RVV.
Adds an RV64 jit_generator_t wrapper (similar in methods to the x64 and aarch64 JIT generators) to encapsulate basic JIT code emission using xbyak_riscv.
Adds a JIT-optimized f32 GEMM micro-kernel jit_rvv_gemm_kernel as part of rvv_gemm_utils for the most important GEMM configuration:
- isTransA = false
- isTransB = false
- n_unroll = 4
The new jit_rvv_gemm_kernel implements:
- A fixed 8×4 micro-tile: $$C[0:8, 0:4] = \alpha \cdot A[0:8, 0:K] \cdot B[0:K, 0:4] + \beta \cdot C[0:8, 0:4]$$
- Column-major interpretation consistent with the existing RVV GEMM code:
  - $A(i, k) = A[i + k \cdot \text{lda}]$
  - $B(k, j) = B[k + j \cdot \text{ldb}]$
  - $C(i, j) = C[i + j \cdot \text{ldc}]$
- RVV vectorization over the M dimension (8 rows) with a 4-way unrolled K loop and a software-pipelined load/FMA schedule.
In rvv_gemm_f32, for the specialization
- isTransA = false
- isTransB = false
- n_unroll = 4
we now delegate the inner 8×4 micro-kernel to jit_rvv_gemm_kernel instead of the hand-written RVV intrinsics loop. All higher-level blocking, threading and tail processing logic remains unchanged.

Checklist

General

[x] Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
[x] Have you formatted the code using clang-format?

Performance Improvements

[x] Have you submitted performance data that demonstrates performance improvements?

We evaluated the new jit_rvv_gemm_kernel through the rvv_matmul primitive (which has been verified to use the JIT kernel).

All measurements were taken on an SG2044 platform with fixed CPU affinity (taskset -c 32) and the same compilation flags (gcc 14.2 -O3). We used:

Benchmark: benchdnn matmul workloads
Data type: f32
benchdnn mode: --mode=P

Results

On average, the JIT kernel improves performance by 1.27× over the existing RVV GEMM implementation.

The detailed per‑shape results are shown below.

Runtime Comparisons (Before vs After, and Speedup)

Batch Shape	Before (ms)	After (ms)	Speedup (Before / After)
shapes_converted_ip_inf_lb_wd	202.78	156.59	1.29×
shapes_converted_ip_inf_lb_gmnt	26.41	22.17	1.19×
shapes_converted_ip_inf_lb_googlenet	255.53	196.46	1.30×
shapes_converted_ip_inf_lb_resnet	113.29	87.90	1.29×
shapes_transformer	145.42	130.74	1.11×
shapes_converted_ip_inf_lb_vgg16	4912.15	2786.26	1.76×
shapes_converted_ip_inf_lb_ncf	39.13	27.89	1.40×
shapes_converted_ip_inf_lb_alexnet	22442.10	18524.10	1.21×
shapes_converted_ip_inf_lb_maskrcnn	4277.70	3432.17	1.25×
shapes_converted_ip_inf_lb_rnn_t	4728.64	3891.70	1.22×
shapes_converted_ip_inf_lb_dlrm	1300.33	1007.73	1.29×
Total	38443.49	30263.71	1.27×

Dec 03 '25 09:12 zhangjian29

More improvements on `conv` computations

Same platforms and test conditions. The JIT kernel improves performance by 1.56× over the existing rvv_gemm_convolution implementation.

Batch Shape	Before (ms)	After (ms)	Speedup (Before / After)
shapes_gemm	19255.9	14326.8	1.34×
shapes_googlenet_v3	23177.3	12308.2	1.88×
shapes_mobilenet	3432.73	1613.88	2.13×
shapes_resnet_50	34015.8	22381.4	1.52×
shapes_vgg_11	21579.9	14315.9	1.51×
Total	101461.63	64946.18	1.56×

Dec 03 '25 10:12 zhangjian29

Hi @zhangfeiv0 ,

Kindly seeking your suggestions on this PR.

Dec 10 '25 01:12 zhangjian29

I'm curious—why test with only a single core bound, rather than multiple cores?

Dec 10 '25 03:12 zhangfeiv0

I'm curious—why test with only a single core bound, rather than multiple cores?

There are many users on the test machine, so I selected a single idle CPU core to isolate it from interference by other user processes. I could select more like 8 cores to test multicore performance.

Dec 10 '25 03:12 zhangjian29

Multicore Performance Comparisons

Using taskset -c 32-39

Batch Shape	Before This PR (ms)	After This PR (ms)	Speedup (Before / After)
shapes_converted_ip_inf_lb_wd	25.6739	23.0902	1.11×
shapes_converted_ip_inf_lb_gmnt	3.41655	3.02603	1.13×
shapes_converted_ip_inf_lb_googlenet	37.4437	30.5433	1.23×
shapes_converted_ip_inf_lb_resnet	33.4940	14.0115	2.39×
shapes_transformer	38.2065	19.6884	1.94×
Total	138.23465	90.35943	1.53×

Batch Shape	Before This PR (ms)	After This PR (ms)	Speedup (Before / After)
shapes_gemm	3010.55	2829.25	1.06×
shapes_googlenet_v3	3685.88	3276.23	1.13×
shapes_mobilenet	610.151	487.503	1.25×
shapes_resnet_50	3959.14	3617.16	1.09×
shapes_vgg_11	3082.23	2860.39	1.08×
Total	14347.951	13070.533	1.10×

Dec 10 '25 09:12 zhangjian29

cpu: rv64: jit: add jit gemm kernel to improve matmul performance

Description

Key Feature: Specialized RV64 jit_gemm_kernel Using xbyak_riscv

Checklist

General

Performance Improvements

Results

Runtime Comparisons (Before vs After, and Speedup)

More improvements on conv computations

Multicore Performance Comparisons

Key Feature: Specialized RV64 `jit_gemm_kernel` Using `xbyak_riscv`

More improvements on `conv` computations