Question: How to better train a dictionary in a specific scenario
Hi, teams~ Now I have a special scenario: There are many chunks with different sizes, with chunk sizes ranging from 1kb to 4mb, most of which are smaller than 64kb. The total size of chunk is about 1gb. Now I want to use zstd dictionary mode to compress these chunks, because dictionary mode is more suitable for small data scenarios. So how can I better train the dictionary to achieve a better compression ratio?
One of my intuitive ideas is that these chunks can be divided into different buckets, and a chunk can be taken out of each bucket to form a training set, and finally use this training set to train the dictionary.
Besides, is there any better way?
@terrelln @embg @Cyan4973 @ all team members Do you have any better way?
The more specialized dictionaries are, the better the compression ratio benefit. Using buckets to achieve this goal is possible. You'll need to use the same bucket assignation rule for training and for compression.
Another question is how to pick a most representative chunk in each bucket? Or do we manually find the content with the highest repetition rate in each bucket, and then use 'ZDICT_finalizeDictionary' to train the dictionary? @Cyan4973
Hi, could you please help me to solve this issue? @Cyan4973 @terrelln , thanks~
Once you have selected a rule to create buckets "on-the-fly", aka ensure it will be the same rule both during training and during compression, then you can start training a dictionary for each bucket.
Finding which content ends up in the dictionary is hard, that's why there is a training stage.
I wouldn't recommend to recreate the training logic on your side, it's quite complex. Sure, it's allowed and possible, since ZDICT_finalizeDictionary() can be used to complete the selection job. But unless you have specific insight and know how to exploit it, let zstd --train find the dictionary content for you, it's much easier.
If your question is about "how to define buckets", then I'm afraid there is no single rule. It all depends on which information you can rely upon or rebuild at compression time. Often times, messages and packets come from different sources, and that's enough to have a bucket rule. Sometimes categorization must be decided on the spot, for example based on input size, or on some easy-to-generate statistic. In any case, there is no "one size fits all" solution.
Closing as the question is answered. Please create a new issue if you have further questions.