arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Addtional `Missing` gets injected into Schema

Open Moelf opened this issue 3 years ago • 3 comments

In the following example, the column doesn't have Missing but after a round trip through Arrow it gained Missing in the Union

julia> rnt.vector_variant_int64_string
5-element RNTupleField{Vector{Union{Int64, String}}}:
 Union{Int64, String}["one"]
 Union{Int64, String}["one", 2]
 Union{Int64, String}["one", 2, 3]
 Union{Int64, String}["one", 2, 3, 4]
 Union{Int64, String}["one", 2, 3, 4, 5]

julia> DataFrame(rnt).vector_variant_int64_string
5-element Vector{Vector{Union{Int64, String}}}:
 ["one"]
 ["one", 2]
 ["one", 2, 3]
 ["one", 2, 3, 4]
 ["one", 2, 3, 4, 5]

julia> Arrow.write(a, DataFrame(rnt))
"/tmp/jl_Lk5W1G92XO"

julia> Arrow.Table(a)
Arrow.Table with 5 rows, 13 columns, and schema:
 :string                       String
 :vector_int32                 Vector{Int32} (alias for Array{Int32, 1})
 :array_float                  Vector{Float32} (alias for Array{Float32, 1})
 :vector_vector_int32          Vector{Vector{Int32}} (alias for Array{Array{Int32, 1}, 1})
 :vector_string                Vector{String} (alias for Array{String, 1})
 :vector_vector_string         Vector{Vector{String}} (alias for Array{Array{String, 1}, 1})
 :variant_int32_string         Union{Missing, Int32, String}
 :vector_variant_int64_string  Vector{Union{Missing, Int64, String}} (alias for Array{Union{Missing, Int64, String}, 1})
 :tuple_int32_string           NamedTuple{(:_0, :_1), Tuple{Int32, String}}
 :pair_int32_string            NamedTuple{(:_0, :_1), Tuple{Int32, String}}
 :vector_tuple_int32_string    Vector{NamedTuple{(:_0, :_1), Tuple{Int32, String}}} (alias for Array{NamedTuple{(:_0, :_1), Tuple{Int32, String}}, 1})
 :lorentz_vector               NamedTuple{(:pt, :eta, :phi, :mass), NTuple{4, Float32}}
 :array_lv                     Vector{NamedTuple{(:pt, :eta, :phi, :mass), NTuple{4, Float32}}} (alias for Array{NamedTuple{(:pt, :eta, :phi, :mass), NTuple{4, Float32}}, 1})

Moelf avatar Jan 06 '23 20:01 Moelf

Hmmmm, yes, I think I remember that for the Union types, the arrow spec makes it hard because it always allows nulls, so we default to including Missing in the Union to account for this. We can/should figure out how to do this cleaner though.

quinnj avatar Jan 07 '23 22:01 quinnj

I see, but for a column of Vector{Union{T, T2}} you don't need Missing right? because the empty element would just be a

Union{T,T2][]

for example, :vector_variant_int64_string

Moelf avatar Jan 08 '23 00:01 Moelf

Correctness bug, bump?

Moelf avatar Feb 01 '24 21:02 Moelf