clp icon indicating copy to clipboard operation
clp copied to clipboard

[WIP] clp-s: Implement table packing

Open gibber9809 opened this issue 7 months ago • 0 comments

Description

This PR implements table-packing; we combine small tables together into one compression stream until they reach a certain size threshold in order to avoid having many tiny compression streams. This helps avoid outliers in compression ratio, particularly when we enable features like array-structurization which can create many small table.

To make this PR easier to review the decompression half of the changes have been reverted until the compression side is reviewed.

On the compression side the key differences are that (1) SchemaWriter now keeps track of the total in-memory size of the table it owns instead of determining it after writing to a compression stream; (2) before compression tables are sorted by that in-memory size, and smaller tables are packed together in sequence until their combined size reaches a certain threshold; and (3) table metadata has been changed to accommodate table packing.

Note: this PR makes the decision to leave uncompressed size of individual schema tables out of the table metadata. This is because uncompressed size can be derived from other metadata we do store, and storing uncompressed size in addition to metadata offsets would actually increase the amount of work we need to do to check an archive isn't corrupt while decompressing it.

Validation performed

  • Validated that this PR fixes bad compression ratio outliers during array structurization
  • Validated that performance seems to be within variance compared to before this change

gibber9809 avatar Jul 02 '24 15:07 gibber9809