BlackLab icon indicating copy to clipboard operation
BlackLab copied to clipboard

Improve content store compression using preset dictionaries

Open jan-niestadt opened this issue 2 years ago • 1 comments

zlib support preset dictionaries, which is a way to improve compression if you know something about the structure of your data ahead of time. See https://www.ietf.org/rfc/rfc1950.txt

In our case, using part of the first document stored as the preset dictionary for each block in the content store would probably improve the compression ratio.

jan-niestadt avatar Sep 05 '22 11:09 jan-niestadt

(comment in doc/index-formats/integrated.md:) A reasonable approach could be to take a chunk from the middle of the first file added (middle to increase the chance we're inside actual text, not metadata) and use that as the dictionary for the entire segment. This should ensure common strings (e.g. XML tags, attributes, common words, etc.) are stored more efficiently in each block.

jan-niestadt avatar Sep 05 '22 11:09 jan-niestadt