tvm Added RISC-V V extension intrinsics for LLVM

Implementation of paper Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Aug 01 '25 10:08 fzi-peccia

cc @cbalint13 can you help to take a look

Aug 01 '25 17:08 tqchen

@fzi-peccia , can look at i386 CI failure ?

Aug 04 '25 16:08 cbalint13

@fzi-peccia , can look at i386 CI failure ?

@fzi-peccia ,

Permit me a change proposal on how to avoid aprofile (serving ARM only), don't know if this will be kept in future. Instead, let's use infos from LLVM side, and reuse existing VLEN inference (via target.llvm_get_vector_width)

Here is how it would look tvm-rvv-noaprofile.diff.txt, appliacable to the top of your current branch.
This also will pass the i386 CI failure caused by the alteration of aprofile (currently ARM only stuff).

I am all-in to see this merged, a very good start for future IME tensorization, beyond what LLVM (will?) supports.

LATER UPDATE

The diff here was reuploaded as .txt file (apologize for later edit).
Idea also inline with: https://github.com/apache/tvm/pull/18199

Aug 10 '25 12:08 cbalint13

Sorry all, I was on vacation, I will tackle these comments this week.

Aug 18 '25 06:08 fzi-peccia

Hi @cbalint13 . Thank you very much for the feedback and the diff. I implemented the changes you suggested and also rebased on main.

Regarding the mixed dtype cases, the original idea was to support this, and this kernel_dtype is a mistake that stayed there from those days. I replaced it now with the input_dtype, and maybe for this version we could merge a version without mix cases, and then add this feature in the future. What do you think?

Aug 18 '25 11:08 fzi-peccia

@fzi-peccia ,

Tests were done by tuning a resnet18 model Here is the TVM program and results after 5000 trials: rvv-resnet18-mstune-rpc-2025Aug22.tar.gz

Tests

In a rpc setup, I used the provided tvm-rvv-tune.py script.

There was trial proposals for tensorization:

$ cat workdir/logs/*.log | grep Tensorizing | awk '{print $NF}' | sort -u
rvv_float32_multivmul_8_16_m8
rvv_float32_multivmul_8_32_m8
rvv_float32_multivmul_8_4_m8
rvv_float32_multivmul_8_64_m8
rvv_float32_multivmul_8_8_m8
rvv_float32_vmacc_1_16_m8
rvv_float32_vmacc_1_32_m8
rvv_float32_vmacc_1_4_m8
rvv_float32_vmacc_1_64_m8
rvv_float32_vmacc_1_8_m8
rvv_float32_vmul_1_16_m8
rvv_float32_vmul_1_32_m8
rvv_float32_vmul_1_4_m8
rvv_float32_vmul_1_64_m8
rvv_float32_vmul_1_8_m8

The post analytics of all entries on IR level:

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Aug 22 '25 14:08 cbalint13

@fzi-peccia ,

$ ./msch-database-tir-parse.py
Parsed #5000 records
No tensorized schedules found.

This needs investigation.

Based on #18224 investigation, it seems the RVV intrinsic templates needs double check (see example fix of issue). The posted code here looked from beginning as being an oldish TVM, using the relay (guessing) as graph import.

Aug 22 '25 22:08 cbalint13

Further, investigated the corectness of the proposed tensorization kernels. The proposed multimvul does multiple dotproducts that would yield highest benefits inside RVV.

All tests here needs https://github.com/apache/tvm/pull/18232

Proposed kernels looks wrong, implementation also produce bad numerical: riscv64-rvv-kernels-pr18182.py.gz

$ ./riscv64-rvv-kernels-pr18182.py 64
Testing rvv_float32_multivmul_8_64_m8
C (output): (8,) [float32]
[1363.    0.    0.    0.    0.    0.    0.    0.]
Output (kernel) [1363.    0.    0.    0.    0.    0.    0.    0.]
Output (numpy) [1363. 1407. 1460. 1388. 1504. 1373. 1268. 1270.]

$ ./riscv64-rvv-kernels-pr18182.py 32
Testing rvv_float32_multivmul_8_32_m8
C (output): (8,) [float32]
[699.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [699.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [699. 493. 671. 707. 635. 639. 764. 611.]

$ ./riscv64-rvv-kernels-pr18182.py 16
Testing rvv_float32_multivmul_8_16_m8
C (output): (8,) [float32]
[425.   0.   0.   0.   0.   0.   0.   0.]
Output (kernel) [425.   0.   0.   0.   0.   0.   0.   0.]
Output (numpy) [425. 192. 382. 464. 465. 382. 438. 202.]
{...}

Here is a working reference fp32 kernel leveraging one-hot full RVV occupancy. riscv64-rvv-full-fp32_kern.py.gz

$ ./riscv64-rvv-full-fp32_kern.py
DEBUG:pydot:pydot initializing
DEBUG:pydot:pydot 3.0.1
DEBUG:pydot.core:pydot core module initializing
DEBUG:pydot.dot_parser:pydot dot_parser module initializing
# from tvm.script import ir as I
# from tvm.script import tir as T

@I.ir_module
class Module:
    @T.prim_func
    def main(A_handle: T.handle, B_handle: T.handle, C_handle: T.handle):
        T.func_attr({"global_symbol": "rvv_dot_4f32_4x4f32_2f32"})
        A = T.match_buffer(A_handle, (4,), align=4, offset_factor=1)
        B = T.match_buffer(B_handle, (4, 4), strides=(4, 1), align=4, offset_factor=1)
        C = T.match_buffer(C_handle, (4,), align=4, offset_factor=1)
        with T.block("root"):
            T.reads(A[0:4], B[0:4, 0:4])
            T.writes(C[0:4])
            zero: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfmv.v.f", T.Broadcast(T.float32(0.0), T.vscale() * 2), C[0], T.uint64(1))
            vec_A: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), A.data, 0, 4, 1), T.int64(4))
            for i in range(4):
                with T.block("reduction"):
                    vi = T.axis.spatial(4, i)
                    T.reads(B[0:4, 0:4])
                    T.writes(C[vi])
                    vec_B: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vle", T.Broadcast(T.float32(0.0), T.vscale() * 4), T.tvm_access_ptr(T.type_annotation("float32"), B.data, vi * 4, 4, 1), T.int64(4))
                    product: T.float32xvscalex4 = T.call_llvm_intrin("float32xvscalex4", "llvm.riscv.vfmul", T.Broadcast(T.float32(0.0), T.vscale() * 4), vec_A, vec_B, T.uint64(7), T.uint64(4))
                    reduction_result_vec: T.float32xvscalex2 = T.call_llvm_intrin("float32xvscalex2", "llvm.riscv.vfredusum", T.Broadcast(T.float32(0.0), T.vscale() * 2), product, zero, T.uint64(7), T.uint64(4))
                    C[vi] = T.call_llvm_intrin("float32", "llvm.riscv.vfmv.f.s", reduction_result_vec)

[6. 6. 9. 3.]
[[3. 7. 7. 7.]
 [0. 2. 5. 7.]
 [3. 9. 5. 7.]
 [9. 3. 6. 1.]]
Output (kernel) [144.  78. 138. 129.]
Output (numpy) [144.  78. 138. 129.]

For this working sample, 4 x (4x4) -> 4xlanes for VLEN=256 @ fp32 case is the maximum for a fully occupied RVV machine.

Now,

beside the matching template issues due to relax flow (exemplified with a working dense/matmul testcase), the numerical implementation of the kernels itself are also wrong and personally I don't see how they fully exploit the RVV machine (also provided a working testcase).

Aug 25 '25 12:08 cbalint13

@fzi-peccia ,

I dont know how to help to forward this, fell free to reuse this working draft. Thank you 🙏

Aug 29 '25 14:08 cbalint13