[Bug] [RISC-V RVV] Performance Regression: sum operator slower on RVV than RV

Open yanyanyanggg opened this issue 2 months ago • 0 comments

Issue: [RISC-V RVV] Performance Regression: sum operator slower on RVV than RV

Description

The sum operator shows significant performance degradation when using the RISC‑V Vector (RVV) extension compared to the scalar RV baseline. The acceleration ratio is 0.325, meaning the RVV version is about 3× slower. This is unexpected because vector extensions should improve performance, especially for reduction operations like sum.

Steps to Reproduce

Generate the sum operator with the following configuration:

params = {
    "dtype": "float32",
    "batch": 14,
    "channels": 23,
    "input_height": 67,
    "input_width": 99,
    "axis": 1,
    "keepdims": True
}

Export the operator to two targets:

RV target (scalar, without vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c

RVV target (with vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c,+v

Run performance measurement on both targets.

Operator definition code:

def export_sum(params, set_dir=None, platform="rv"):
    data = relay.var("data",
                     shape=(params["batch"], params["channels"],
                            params["input_height"], params["input_width"]),
                     dtype=params["dtype"])
    sum_op = relay.sum(data, axis=params["axis"], keepdims=params["keepdims"])
    export_op(sum_op, params["op_name"], [data], params, set_dir=set_dir)

Performance Data

RV execution time: 9.301150 ms
RVV execution time: 28.622800 ms
Acceleration ratio (RV/RVV): 0.325 (RVV is ~3× slower)

Environment Information

TVM version: 0.19.0
LLVM version: [Please provide: llvm-config --version]
Hardware: Spacemit K1‑X bit‑brick board
CPU: Spacemit X60 (8 cores, 1.6 GHz)
ISA: rv64imafdcv (with vector extensions)
Memory: 7.6 GB
OS: Bianbu 2.2, Linux kernel 6.6.63

Expected Behavior

RVV vectorization should provide a performance improvement over the scalar RV baseline for reduction operations like sum.

Additional Context

The sum operation reduces along axis=1 on a tensor of shape (14, 23, 67, 99) (≈1.7M elements).
The performance regression suggests suboptimal vectorization for reduction operations on RVV.
Other operators (log, relu, bias_add, sqrt, etc.) also show similar regressions, indicating a broader RVV code‑generation or optimization issue.

Dec 09 '25 04:12 yanyanyanggg