[Bug] [RISC-V RVV] Performance Degradation: ReLU activation slower with vector extension

Open yanyanyanggg opened this issue 2 months ago • 0 comments

Issue: [RISC-V RVV] Performance Degradation: ReLU activation slower with vector extension

Description

The ReLU (rectified linear unit) operator shows significant performance degradation with the RISC‑V Vector (RVV) extension. The acceleration ratio is 0.337, meaning the RVV version is about 3× slower than the scalar implementation. This is unexpected for a simple elementwise operation that should benefit greatly from vectorization.

Steps to Reproduce

Generate the ReLU operator with the following configuration:

params = {
    "dtype": "float32",
    "batch": 14,
    "channels": 23,
    "input_height": 67,
    "input_width": 99
}

Export the operator to two targets:

RV target (scalar, without vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c

RVV target (with vector extension):

llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c,+v

Run performance measurement on both targets.

Operator definition code:

def export_relu(params, set_dir=None, platform="rv"):
    data = relay.var("data",
                     shape=(params["batch"], params["channels"],
                            params["input_height"], params["input_width"]),
                     dtype=params["dtype"])
    relu = relay.nn.relu(data)
    export_op(relu, params["op_name"], [data], params, set_dir=set_dir)

Performance Data

RV execution time: 7.945310 ms
RVV execution time: 23.579300 ms
Acceleration ratio (RV/RVV): 0.337 (RVV is ~3× slower)

Environment Information

TVM version: 0.19.0
LLVM version: [Please provide: llvm-config --version]
Hardware: Spacemit K1‑X bit‑brick board
CPU: Spacemit X60 (8 cores, 1.6 GHz)
ISA: rv64imafdcv (with vector extensions)
Memory: 7.6 GB
OS: Bianbu 2.2, Linux kernel 6.6.63
Operation: Elementwise ReLU on ~1.7M elements

Expected Behavior

RVV vectorization should provide a performance improvement over the scalar RV baseline for simple elementwise operations like ReLU.

Additional Context

The ReLU operation is applied elementwise to a tensor of ~1.7M elements.
The severe performance regression (3× slower) is particularly surprising for such a simple operation that should be a perfect candidate for vectorization.
This issue is part of a broader pattern where multiple operators (sum, log, relu, bias_add, sqrt, etc.) show significant performance degradation with RVV, suggesting a potential systemic issue in TVM's RVV code generation or optimization.

Dec 09 '25 04:12 yanyanyanggg