feat: Support dictionary encoding of values
In many cases, a "large" value (for example a string) may be used in multiple operations, across multiple entities. Currently Kaskada must keep a copy of these values for each operation and entity. A more efficient solution would be to leverage Arrow's dictionary-encoded memory layout to store pointers to the underlying value. These pointers could be shared across operations and entities, substantially reducing the size of both the computation state and the final output.
Two syntax variants have been discussed, one that presents like a function:
let dict_of_foo = dict(Foo)
...and another that presents like a type cast:
let dict_of_foo = Foo as dict<string>
The downside of the latter is that it potentially requires more knowledge of the type system than the former. The downside of the former is that it presents as a function but behaves like a type cast, potentially causing confusion as to the appropriate use of as vs other ways of coercing types.
I wouldn't read the fist as a cast -- I would read it as "create the dictionary encoded version of Foo" (so treat it as a constructor). Then it is no different from functions we've talked about to create a list (list(5, 6, 7) for instance).
Regarding the ticket -- I think this should be split into two items. Specifically:
- Using Arrow dictionary encoding. Note that this only affects the representation of the batches -- so the I/O at shuffle boundaries and the final output (and not all formats will support it -- this is likely limited to Parquet).
- Using something like dictionary encoding for operator state. That currently uses a completely separate serialization to write to RocksDB, so should be an independent work item.