arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

(de)serialization behavior of `missing`/`nothing`

Open jrevels opened this issue 4 years ago • 1 comments

In Julia, there is (generally) a useful/meaningful semantic distinction between nothing and missing. IIUC, Arrow doesn't really have equivalent values that capture this distinction, but instead has null which might be used for either. This results in a bit of an impedance mismatch for us to resolve when (de)serializing nothing/missing data.

The current behavior feels like it "resolves" the impedance mismatch just by tossing this information altogether and normalizing to a single value, but the value it chooses to normalize to feels weird to me:

julia> Arrow.Table(Arrow.tobuffer((x = [missing, missing],))).x
2-element Arrow.NullVector{Missing}:
 missing
 missing

julia> Arrow.Table(Arrow.tobuffer((x = [nothing, nothing],))).x
2-element Arrow.NullVector{Nothing}:
 nothing
 nothing

julia> Arrow.Table(Arrow.tobuffer((x = [nothing, missing],))).x
2-element Arrow.NullVector{Nothing}:
 nothing
 nothing
 
 julia> Arrow.Table(Arrow.tobuffer((x = Any[nothing, missing],))).x
2-element Arrow.NullVector{Missing}:
 missing
 missing

It seems to me like Arrow.jl should either:

  1. find some way to consistently preserve this distinction in all cases when (de)serializing Julia data (e.g. so that [nothing, missing] would roundtrip as [nothing, missing])
  2. lean all-in on dropping the distinction, and force callers to pick what they want to interpret incoming Arrow nulls (e.g. nothing or missing) at read time.

jrevels avatar Nov 09 '21 21:11 jrevels

FWIW option 2 is the approach JSON.jl takes with the null keyword argument to parse/parsefile. It has a default value of nothing but you can pass null=missing. This seems like a reasonable approach to me for Arrow to take.

ararslan avatar Nov 09 '21 21:11 ararslan