vortex icon indicating copy to clipboard operation
vortex copied to clipboard

Epic: strings

Open gatesn opened this issue 9 months ago • 1 comments

  • [ ] #2653
  • [ ] VarBinView builder to have option to de-duplicate (replaces StringDictBuilder)
  • [ ] DictLayout to share a dictionary across chunks (probably requires an is_in expression)
  • [ ] Look into whether FSSTView array makes sense (vs storing FSST data in a VarBin)

gatesn avatar Mar 11 '25 13:03 gatesn

Duckdb string of dict is a bit funky since they have selection vectors. But essentially they have a varbin string (offsets and bytes) and codes (via selection vector). There's special handling for large strings where large strings are not in dictionary and are indicated by negative offset value. For FSST arrays they do a small optimization where if the strings are small (ALL <=12 bytes) they allocate 24 bytes per value and decompress fsst values directly into the symbol table.

robert3005 avatar Mar 18 '25 16:03 robert3005