Proposal: Add RISC-V Vector (RVV) Optimizations to Accelerate Core Algorithms
Hi Density team,
First of all, thank you for creating this fantastic high-performance compression library.
I'm writing to propose the addition of RISC-V Vector (RVV) optimizations to further enhance the library's already impressive performance, specifically on the RISC-V architecture. As RISC-V continues to gain traction in high-performance and embedded systems, providing native RVV-accelerated performance would be a great asset and a natural extension of the library's goals.
Analysis of Potential Hotspots
After analyzing the source code, particularly src/algorithms/chameleon/chameleon.rs, I've identified the core encoding/decoding loop as an excellent candidate for vectorization. The encode_quad function (and its counterparts in the decoder) contains several operations that are highly parallelizable:
-
Hash Calculation: The operation
(quad.wrapping_mul(CHAMELEON_HASH_MULTIPLIER) >> ...)insideencode_quadis a prime target. RVV can perform this multiplication and shift on multipleu32quads simultaneously. -
Dictionary Access: With a batch of calculated hashes, it should be possible to use RVV's "gather" instructions (
vle32.vwith vector-indexed addressing) to load multiple values from thechunk_mapdictionary in parallel. -
Conditional Logic: The core decision
if *dictionary_value != quadcan be handled using vector comparison instructions to generate a mask. This mask can then be used to control which lanes write the originalquad(a miss) versus thehash(a hit). -
Memory Writes: The final writes to the output buffer can also be accelerated using vector store instructions, or scatter instructions for more complex scenarios.
Proposed Implementation
My idea is to introduce conditionally compiled code paths using #[cfg(all(target_arch = "riscv64", target_feature = "v"))].
Inside such a block, we could implement a new, vectorized version of the block processing loop that would handle a batch of quads (e.g., 4, 8, or more, depending on VLEN) at once using RVV intrinsics. This approach would keep the existing, portable code untouched while offering significant speedups on capable hardware. The same principle could be applied to Cheetah and Lion as well.
Questions for the Maintainers
Before I proceed with a proof-of-concept, I'd love to get your feedback:
- Is this a direction that aligns with the project's roadmap?
- Would you be open to accepting a Pull Request for this feature if implemented cleanly and with corresponding benchmarks?
- Are there any architectural preferences or potential pitfalls I should be aware of when integrating this kind of platform-specific optimization?
I am willing to work on implementing this feature.
Thank you for your time and consideration.
Best regards.