arrow-julia
arrow-julia copied to clipboard
(de)serialization behavior of `missing`/`nothing`
In Julia, there is (generally) a useful/meaningful semantic distinction between nothing and missing. IIUC, Arrow doesn't really have equivalent values that capture this distinction, but instead has null which might be used for either. This results in a bit of an impedance mismatch for us to resolve when (de)serializing nothing/missing data.
The current behavior feels like it "resolves" the impedance mismatch just by tossing this information altogether and normalizing to a single value, but the value it chooses to normalize to feels weird to me:
julia> Arrow.Table(Arrow.tobuffer((x = [missing, missing],))).x
2-element Arrow.NullVector{Missing}:
missing
missing
julia> Arrow.Table(Arrow.tobuffer((x = [nothing, nothing],))).x
2-element Arrow.NullVector{Nothing}:
nothing
nothing
julia> Arrow.Table(Arrow.tobuffer((x = [nothing, missing],))).x
2-element Arrow.NullVector{Nothing}:
nothing
nothing
julia> Arrow.Table(Arrow.tobuffer((x = Any[nothing, missing],))).x
2-element Arrow.NullVector{Missing}:
missing
missing
It seems to me like Arrow.jl should either:
- find some way to consistently preserve this distinction in all cases when (de)serializing Julia data (e.g. so that
[nothing, missing]would roundtrip as[nothing, missing]) - lean all-in on dropping the distinction, and force callers to pick what they want to interpret incoming Arrow
nulls (e.g.nothingormissing) at read time.
FWIW option 2 is the approach JSON.jl takes with the null keyword argument to parse/parsefile. It has a default value of nothing but you can pass null=missing. This seems like a reasonable approach to me for Arrow to take.