Add support for dictionaries shared across multiple columns
One interesting feature from the F3 paper is shared dictionaries, across a combination of columns. The extreme version of this would be a single dictionary referenced by all the columns. There is some related work in the C3 repo as well: https://github.com/cwida/C3
I am assuming this is something that the vortex layout could accomodate. Any pointers on how to approach this with the current extensibility layers?
Hi @aditanase !
Thanks for creating this issue! Issue #2657 tracks (a portion of) our string wishlist.
In the single column case, what you've described above is implemented as the DictLayout (see this folder in vortex-layout). The dictionary layout has two child layouts: values and codes. The values is the dictionary and the codes are indices therein. The codes can be (and, indeed, in the default btrblocks-style compressor are) stored as a ChunkedLayout which permits either streamed or partitioned reading of the codes separately from the values.
The extreme version of this would be a single dictionary referenced by all the columns.
Yeah, this would be very cool! We are not currently working on that; though we're aware of the F3 paper [1]. The Vortex community is eager to welcome new open source contributors! I think the best way to get started is to propose a design. There's also now a Slack community you can join here.
The DictLayout is probably the best place to start. A MultiColumnDictLayout should look similar. Maybe it's exactly a DictLayout where the codes are required to be a StructLayout? That might require some kind of MergeLayout to stitch together the non-multi-column-dict columns with the multi-column-dict columns.
[1] For anyone else stumbling on this issue, the paper is: Zeng, et al., "F3: The Open-Source Data File Format for the Future" https://db.cs.cmu.edu/papers/2025/zeng-sigmod2025.pdf .