BlackLab
BlackLab copied to clipboard
Improve content store compression using preset dictionaries
zlib support preset dictionaries, which is a way to improve compression if you know something about the structure of your data ahead of time. See https://www.ietf.org/rfc/rfc1950.txt
In our case, using part of the first document stored as the preset dictionary for each block in the content store would probably improve the compression ratio.
(comment in doc/index-formats/integrated.md
:) A reasonable approach could be to take a chunk from the middle of the first file added (middle to increase the chance we're inside actual text, not metadata) and use that as the dictionary for the entire segment. This should ensure common strings (e.g. XML tags, attributes, common words, etc.) are stored more efficiently in each block.