vortex
vortex copied to clipboard
Epic: strings
- [ ] #2653
- [ ] VarBinView builder to have option to de-duplicate (replaces StringDictBuilder)
- [ ] DictLayout to share a dictionary across chunks (probably requires an is_in expression)
- [ ] Look into whether FSSTView array makes sense (vs storing FSST data in a VarBin)
Duckdb string of dict is a bit funky since they have selection vectors. But essentially they have a varbin string (offsets and bytes) and codes (via selection vector). There's special handling for large strings where large strings are not in dictionary and are indicated by negative offset value. For FSST arrays they do a small optimization where if the strings are small (ALL <=12 bytes) they allocate 24 bytes per value and decompress fsst values directly into the symbol table.