Proposal: Add RISC-V Vector (RVV) Optimizations to Accelerate Core Algorithms

Open Dayuxiaoshui opened this issue 5 months ago • 13 comments

Hi Density team,

First of all, thank you for creating this fantastic high-performance compression library.

I'm writing to propose the addition of RISC-V Vector (RVV) optimizations to further enhance the library's already impressive performance, specifically on the RISC-V architecture. As RISC-V continues to gain traction in high-performance and embedded systems, providing native RVV-accelerated performance would be a great asset and a natural extension of the library's goals.

Analysis of Potential Hotspots

After analyzing the source code, particularly src/algorithms/chameleon/chameleon.rs, I've identified the core encoding/decoding loop as an excellent candidate for vectorization. The encode_quad function (and its counterparts in the decoder) contains several operations that are highly parallelizable:

Hash Calculation: The operation (quad.wrapping_mul(CHAMELEON_HASH_MULTIPLIER) >> ...) inside encode_quad is a prime target. RVV can perform this multiplication and shift on multiple u32 quads simultaneously.
Dictionary Access: With a batch of calculated hashes, it should be possible to use RVV's "gather" instructions (vle32.v with vector-indexed addressing) to load multiple values from the chunk_map dictionary in parallel.
Conditional Logic: The core decision if *dictionary_value != quad can be handled using vector comparison instructions to generate a mask. This mask can then be used to control which lanes write the original quad (a miss) versus the hash (a hit).
Memory Writes: The final writes to the output buffer can also be accelerated using vector store instructions, or scatter instructions for more complex scenarios.

Proposed Implementation

My idea is to introduce conditionally compiled code paths using #[cfg(all(target_arch = "riscv64", target_feature = "v"))].

Inside such a block, we could implement a new, vectorized version of the block processing loop that would handle a batch of quads (e.g., 4, 8, or more, depending on VLEN) at once using RVV intrinsics. This approach would keep the existing, portable code untouched while offering significant speedups on capable hardware. The same principle could be applied to Cheetah and Lion as well.

Questions for the Maintainers

Before I proceed with a proof-of-concept, I'd love to get your feedback:

Is this a direction that aligns with the project's roadmap?
Would you be open to accepting a Pull Request for this feature if implemented cleanly and with corresponding benchmarks?
Are there any architectural preferences or potential pitfalls I should be aware of when integrating this kind of platform-specific optimization?

I am willing to work on implementing this feature.

Thank you for your time and consideration.

Best regards.

Jul 23 '25 10:07 Dayuxiaoshui