Feather.jl
Feather.jl copied to clipboard
Handling categorical values from Arrow
Given the way Arrow treats nominal variables maybe it would be cleaner that we read them in as PooledArray
not CategoricalArray
because they are essentially a PooledArray
and recently we are considering adding more support for this type in DataFrames.jl.
CC @nalimilan
Good question. Looking at the docs, it seems that levels in what Arrow calls a "dictionary encoded" column can appear in an arbitrary order, which we could consider as significant or not. The answer to that question should determine whether to return a CategoricalArray
(order is meaningful) or a PooledArray
(order is an implementation detail).
I guess a good way to asses this is to see whether saving a factor from R and loading it again preserves the custom order of levels. I think this also applies to Pandas.
You can check in Julia that saving CategoricalArray
using Feather.jl and loading it back retains all levels (even if they are not present in the vector - it is enough that they are present in levels) but does not keep their order.