arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Nullable fields don't always need Union{Missing, T}

Open evetion opened this issue 2 years ago • 2 comments

I'm trying to implement the GeoArrow spec, which gives back coordinates in a deeply nested list of a FixedList (a point). Because these lists are theoretically nullable, in Julia we get an deeply nested list with Unions of Missing, even though these vectors contain no missings. An example for a column of LineStrings (there are geometry types that require two more levels of nesting):

2-element Arrow.List{Vector{Union{Missing, Vector{Union{Missing, Tuple{Float64, Float64}}}}}

It's pretty hard to convert these elements to a concrete Vector{Vector{NTuple, Float64}} without allocating. Is there a way to edit the view to be non missing? An alternative way would be to pass all(validitybitmap) in build to juliaeltype, so we only set Missing when there are actual missing values.

I'm happy to make a PR if there's consensus on what to do.

Might be related to #373.

evetion avatar Jan 27 '23 07:01 evetion

We recently updated the Arrow.List type to return a SubArray into the underlying data array; does that help your overall issue here w/ the allocations?

Yeah, we could potentially check the validitybitmap to see if there are any missings before building the eltype, but it does make me a tad nervous for some unrelated side effects it might introduce.

I'd say let's go for a PR and then we can take a look at how much work this would actually be.

quinnj avatar Jun 13 '23 03:06 quinnj

I don't think it's fixed:

julia> col1 = Vector{Union{Int64, String}}[
        ["one", 2],
        ["one", 2, 3],
        ["one", 2, 3, 4],
        ["one", 2, 3, 4, 5]];

julia> df = DataFrame(;col1)
4×1 DataFrame
 Row │ col1
     │ Array…
─────┼───────────────────────────────────
   1 │ Union{Int64, String}["one", 2]
   2 │ Union{Int64, String}["one", 2, 3]
   3 │ Union{Int64, String}["one", 2, 3…
   4 │ Union{Int64, String}["one", 2, 3…

julia> a = tempname()
"/tmp/jl_IngNyJwngp"

julia> Arrow.write(a, df)
"/tmp/jl_IngNyJwngp"

julia> Arrow.Table(a)
Arrow.Table with 4 rows, 1 columns, and schema:
 :col1  …  SubArray{Union{Missing, Int64, String}, 1, Arrow.DenseUnion{Union{Missing, Int64, String}, Arrow.UnionT{Arrow.Flatbuf.UnionMode.Dense, nothing, Tuple{Union{Missing, Int64}, String}}, Tuple{Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}, Arrow.List{String, Int32, Vector{UInt8}}}}, Tuple{UnitRange{Int64}}, true}

julia> Arrow.Table(a).col1[1]
2-element view(::Arrow.DenseUnion{Union{Missing, Int64, String}, Arrow.UnionT{Arrow.Flatbuf.UnionMode.Dense, nothing, Tuple{Union{Missing, Int64}, String}}, Tuple{Arrow.Primitive{Union{Missing, Int64}, Vector{Int64}}, Arrow.List{String, Int32, Vector{UInt8}}}}, 1:2) with eltype Union{Missing, Int64, String}:
  "one"
 2

Moelf avatar Jun 13 '23 04:06 Moelf