how can we get symbols to be written as dictionary encoded strings
it seems the mechanism for doing this is in the library: for instance given a symbol vector:
sym:`a`b`c`a`a`c
dvalues:distinct sym
indices:dvalues?sym
/ideally we would use the smallest type that can support the number of distinct symbols:
mt:(.arrowkdb.dt[`int8`int16`int32`int64])!im:floor 2 xexp 0 7 15 31
mkt:4 5 6 7h!im
indextype:mt bin c:count dvalues
indexktype:mkt bin c
datatype_symbol:.arrowkdb.dt.dictionary[.arrowkdb.dt.utf8[];indextype[]]
/we can even pretty print the type we want:
.arrowkdb.ar.prettyPrintArray[datatype_symbol;(dvalues;indexktype$indices);::]
but what's not clear is how to enhance the current inferSchema to do this calculation, this means that currently tables that have symbols are not the same after the round trip and all the symbols are cast to type string
Yeah, that's a tricky one One simple solution would be to have an option to decode utf8 as symbols but it would apply to all columns in the table which I suspect is not what you want. Another possibility would be to have a utf8_as_symbol option which only applies to dictionary keys. I think that's possible but again would apply to all dictionary columns in the table.
I suspect, that in many use cases treating dictionary encoded strings as symbols always, is exactly the behavior you want. If you have other dictionary encoded types, (ie if you have Uint64 but only use a few of them so they are dictionary encoded that would probably end up turning into the Uint64 but that seems fine.
I agree that dictionary utf8 values as symbol is a reasonable use case. However, I'm not sure it could be the default - generally users don't like decoding to symbols as the default because of symbol bloat. But I think a DICTIONARY_UTF8_VALUES_AS_SYMBOL option could work for you.
A fun extra you can add to your dict encoding is to exclude the null from the dictionary and use a null bit mask to override the find result. These will convert out on the other side to a dictionary with a nullity mask for the index. So you can have separate nullity checks for symbols and character vecs that are isomorphic to the KDB types and your types will all round trip as you expect.
This is also nice because it means the plain utf8 type can skip a nullity vector.