valkey icon indicating copy to clipboard operation
valkey copied to clipboard

[NEW] RDB Compression via LZ4, Batching, and Batch-Level Dictionary

Open sarthakaggarwal97 opened this issue 8 months ago • 5 comments

This proposal introduces a refined approach to handling key–value pair serialization and compression in Valkey. By integrating LZ4 compression, employing batch processing of key–value pairs, and applying an auxiliary dictionary over data chunks, we aim to achieve improved compression ratios.

Motivation

Historically, serializing key–value data as discrete records led to suboptimal compression efficiency. By batching multiple records and leveraging a dictionary structure to capture common substrings within these batches, we have the opportunity to:

  • Enhance Compression: The combination of LZ4’s rapid compression algorithm with dictionary-based deduplication leads to smaller file sizes.
  • Improve Throughput: Batching reduces per-record overhead, optimizing both write and read performance.
  • Adapt to Data Redundancy: The employed dictionary mechanism is well-suited for environments where key or value similarities are prevalent, thus capitalizing on inherent data redundancy.

Detailed Implementation

1. LZ4 Compression Integration

  • Library Integration: We have incorporated the LZ4 library into the RDB flow.
  • Compression API: The API now compresses raw byte arrays of batched key–value pairs.
  • Error Handling: Additional checks ensure fallbacks in cases where compression fails, preserving data integrity.

2. Batching of Key–Value Pairs

  • Batching Buffer: Instead of serializing each key–value pair independently, records are appended to a temporary in-memory batch buffer.
  • Batch Size: A tunable batch size parameter controls the maximum number of records per batch, balancing memory overhead with compression gains.
  • Serialization: The batched data is pre-processed into a continuous byte sequence, increasing the redundancy that compression algorithms can exploit.

3. Dictionary on Top of Chunks

  • Dictionary Creation:
    • For each batch, we generate a dictionary that maps frequent substrings or token patterns.
    • This dictionary is used in a two-phase compression: first to substitute repeating patterns, then to feed the result into LZ4. Dictionary is managed by the LZ4 library.
  • Chunk Division:
    • As soon as the batched data size reaches the configured chunk size, the chunk is compressed and is flushed into RDB.
    • A dictionary is maintained for each chunk, capturing local redundancy.
  • Compression Flow:
    • The chunk-level dictionary is applied to transform the chunk into a more compressible representation by passing the dictionary and batched data to the LZ4 Compressor.

4. Low-Level Data Flow Diagram

The following diagram describes the low-level data flow and the interaction between components:

RDB Decompression Flow

Upon application restart, the system needs to reload key–value data from the compressed RDB file while maintaining high performance and data integrity. The process follows these steps:

1. RDB File Loading

  • File Open & Metadata Parsing: The process begins by opening the RDB file and parsing the header information. This header contains metadata such as chunk boundaries, compression flags, and dictionary references.
  • Integrity Check: Before decompression starts, a checksum or integrity validation step is performed to ensure the data file is not corrupted.

2. Chunk-wise Data Processing

  • Chunk Identification: The file is logically divided into multiple chunks. Each chunk corresponds to a batch of key–value pairs that were previously compressed.
  • Metadata Extraction: For each chunk, the associated metadata (including dictionary details and compression parameters) is extracted to guide the decompression procedure.

3. Decompression Pipeline

  • LZ4 Decompression: The LZ4 engine is invoked to decompress the compressed byte stream. This reverts the compressed data back to the dictionary-transformed format.
  • Dictionary Restoration: Using the metadata, the corresponding dictionary for the chunk is loaded. The dictionary is then applied to reverse the earlier substitution process, restoring the original representation of the data.

4. Batch Deserialization & Data Reconstruction

  • Batch Reassembly: After all chunks have been decompressed and dictionary transformations reversed, they are combined to recreate the original batch of serialized key–value pairs.

5. Application Initialization

  • In-Memory Loading: Once deserialization is complete, the key–value pairs are loaded into the application's memory structures for immediate use.
  • Validation: Optionally, a secondary check is performed to ensure that the decompression and deserialization yield a consistent, complete, and correct data set.

Low-Level Decompression Flow Diagram

Performance Benchmarks

Notes:

  1. We see immense gains with Valkey Benchmark, because a same 20kb sequence of value is used in all the set operations, where batching and dictionary is powerful.
  2. There is tradeoff of compute and memory. The buffer gives us a larger window to find common patterns for compression.
  3. Most of the valkey users should belong to the third category of benchmarks
  4. The POC is for string key value pairs.

Image

Related Work

  • Code POC: https://github.com/valkey-io/valkey/compare/unstable...sarthakaggarwal97:valkey:valkey-rdb-batch-dict
  • Related Issues: https://github.com/valkey-io/valkey/issues/1223

sarthakaggarwal97 avatar Apr 16 '25 03:04 sarthakaggarwal97

Thanks for the detailed write up @sarthakaggarwal97!

Will add a bit of context behind the issue, we weren't seeing much benefit with performing a drop in replacement of LZF library with LZ4 library (Sarthak had shared the results here https://github.com/valkey-io/valkey/issues/1223#issuecomment-2676743759). So, wanted to go a step further to apply batching logic with it to understand if it's worth taking a dependency on the library. With the batching logic, we were able to take benefit of the dictionary structure mentioned above. The upside is we are able to reduce the RDB size/time to save/load it. The potential downside is we need to allocate a static buffer (configurable) to handle the batching of data so there will be a small bump in memory usage to accommodate that. We're reusing the buffer so there shouldn't be a lot of fragmentation of memory.

So, overall if we are to proceed with this approach, we need to accept

  1. LZ4 library to the project.
  2. Serialize/Deserialize data as batch.

Note: There is no such dictionary mechanism provided in LZF to leverage upon.

Now coming to the benchmark number shared above, I think for small data caching workload the results aren't that promising. However, with increasing usage of JSON/Search workload and our recent introduction of modules for those we should ideally see a increase in larger datasets to be stored in Valkey. We could potentially see lot of benefit with such usage (see data set 2 in the benchmark stats above). So, there is some gains to achieve as well with this approach.

Would love to hear others opinion about this in the community. If anyone has any real world data set to share to benchmark upon, feel free to provide that.

hpatro avatar Apr 17 '25 01:04 hpatro

Discussion points in the weekly meeting:

  1. LZ4 dependency concern. One C file and one header file. It should be safe to Include it in the repo, but do we want to be responsible to vendor. @PingXie is going to look into if we should vendor this or not.
  2. Check integration at the RIO layer to simplify the implementation.
  3. The feature should minimize the number of configurations. We will need an RDB opcode bump which we will have with 9.0. Replica and primaries should negotiate if they should use the compression.

High level approved to move towards implementation.

madolson avatar May 12 '25 15:05 madolson

Thanks for discussing this.

@madolson @sarthakaggarwal97 Did we discuss the batching mechanism proposed above?

hpatro avatar May 13 '25 15:05 hpatro

@hpatro yes, we discussed the batching mechanism, no major concerns were raised. @madolson pointed out if we can use the chunks already formed in the RIO layer (so that modules can also benefit), which is something I will explore a bit. If not, we can incrementally support data types with batching and dictionary.

sarthakaggarwal97 avatar May 14 '25 22:05 sarthakaggarwal97

LZ4 dependency concern. One C file and one header file. It should be safe to Include it in the repo, but do we want to be responsible to vendor. @PingXie is going to look into if we should vendor this or not.

I am ok with vendoring LZ4. It is a small file. from looking at the past CVEs, it appears that this is a very stable component (2 CVEs in the past 6 years, 1 in 2019 and 1 in 2021).

Also, I think we should set up "release watch" on the LZ4 project so we don't have to "poll" their status (and that of other vendor'd dependencies).

PingXie avatar May 19 '25 18:05 PingXie