tvm
tvm copied to clipboard
[Bug] [RISC-V RVV] Performance Issue: bias_add operator slower with vectorization
Issue: [RISC-V RVV] Performance Issue: bias_add operator slower with vectorization
Description
The bias_add operator shows significant performance degradation when using the RISC‑V Vector (RVV) extension. With an acceleration ratio of 0.360, the RVV implementation is nearly 3× slower than the scalar implementation. This is unexpected for a channel‑wise addition operation that should benefit from vectorization.
Steps to Reproduce
- Generate the bias_add operator with the following configuration:
params = {
"dtype": "float32",
"batch": 14,
"channels": 23,
"input_height": 67,
"input_width": 99
}
-
Export the operator to two targets:
-
RV target (scalar, without vector extension):
llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c -
RVV target (with vector extension):
llvm -mtriple=riscv64-linux-gnu -mcpu=generic-rv64 -mabi=lp64d -mattr=+64bit,+m,+a,+f,+d,+c,+v
-
RV target (scalar, without vector extension):
-
Run performance measurement on both targets.
Operator definition code:
def export_bias_add(params, set_dir=None, platform="rv"):
data = relay.var("data",
shape=(params["batch"], params["channels"],
params["input_height"], params["input_width"]),
dtype=params["dtype"])
bias = relay.var("bias", shape=(params["channels"],), dtype=params["dtype"])
bias_add = relay.nn.bias_add(data, bias)
export_op(bias_add, params["op_name"], [data, bias], params, set_dir=set_dir)
Performance Data
- RV execution time: 7.683920 ms
- RVV execution time: 21.363800 ms
- Acceleration ratio (RV/RVV): 0.360 (RVV is ~2.8× slower)
Environment Information
- TVM version: 0.19.0
-
LLVM version: [Please provide:
llvm-config --version] - Hardware: Spacemit K1‑X bit‑brick board
- CPU: Spacemit X60 (8 cores, 1.6 GHz)
- ISA: rv64imafdcv (with vector extensions)
- Memory: 7.6 GB
- OS: Bianbu 2.2, Linux kernel 6.6.63
- Operation: Channel‑wise bias addition on a tensor of shape (14, 23, 67, 99)
Expected Behavior
RVV vectorization should provide a performance improvement over the scalar RV baseline for broadcast addition operations like bias_add.
Additional Context
- The bias_add operation adds a 1D bias vector to each channel of a 4D tensor (≈1.7M elements total).
- The performance regression is severe and similar to other operators (sum, log, relu, etc.).
- This suggests that the current RVV vectorization for broadcast operations may be suboptimal, or there are inefficiencies in memory access patterns or instruction selection.