arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Loss of parametric type information for custom types

Open omus opened this issue 4 years ago • 3 comments

When using a table where a column contains a variety of types with different parameters this information can be lost:

julia> using Arrow, Intervals

julia> table = (col = [
           Interval{Closed,Closed}(1,2),
           Interval{Open,Closed}(1,2),
           Interval{Closed,Unbounded}(1,nothing),
       ],);

julia> table.col
3-element Array{Interval{Int64,L,R} where R<:Bound where L<:Bound,1}:
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Open,Closed}(1, 2)
 Interval{Int64,Closed,Unbounded}(1, nothing)

julia> Arrow.write("ex.arrow", table);

julia> Arrow.Table("ex.arrow").col
3-element Arrow.Struct{Interval{Int64,L,R} where R<:Bound where L<:Bound,Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Int64,Array{Int64,1}}}}:
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Closed,Closed}(1, 2)
 Interval{Int64,Closed,Closed}(1, 1)

For the particular Interval type the problem is worse as the undefined type parameters are inferred from the arguments:

julia> (Interval{Int64,L,R} where R<:Bound where L<:Bound)(1,2)
Interval{Int64,Closed,Closed}(1, 2)

omus avatar Feb 23 '21 15:02 omus

Ok, this will now correctly error on #156 PR.

quinnj avatar Mar 25 '21 17:03 quinnj

Using ArrowTypes.arrowmetadata as shown in the Intervals example in the documentation you can only serialize a column where all of the type parameters are the same. Having mixture of type parameters does not work:

table = (col = [
    Interval{Closed,Unbounded}(1,nothing),
    Interval{Unbounded,Closed}(nothing,2),
],)

omus avatar Jul 07 '21 15:07 omus

I attempted to work around this on Arrow 1.6 (not quite yet released) by storing the parametric information as part of the value as using ArrowTypes.arrowmetadata can't handle element variation. An implementation of this looks like:

using Arrow, ArrowTypes, Intervals

table = (;
    col=[
        Interval{Closed,Unbounded}(1,nothing),
        Interval{Unbounded,Closed}(nothing,2),
    ]
)

for T in (Closed, Open, Unbounded)
    name = QuoteNode(Symbol("JuliaLang.Intervals.$(string(T))"))

    @eval begin
        ArrowTypes.arrowname(::Type{$T}) = $name
        ArrowTypes.JuliaType(::Val{$name}) = $T
    end
end

let name = Symbol("JuliaLang.Intervals.Interval")
    ArrowTypes.arrowname(::Type{<:Interval{T}}) where T = name
    ArrowTypes.ArrowType(::Type{<:Interval{T}}) where T = NamedTuple{(:left, :right), Tuple{Tuple{String, T}, Tuple{String, T}}}
    function ArrowTypes.toarrow(x::Interval{T,L,R}) where {T,L,R}
        return (; left=(string(arrowname(L)), x.first), right=(string(arrowname(R)), x.last))
    end
    ArrowTypes.JuliaType(::Val{name}) = Interval
    function ArrowTypes.fromarrow(::Type{Interval}, left, right)
        T = typeof(left[2])
        L = ArrowTypes.JuliaType(Val(Symbol(left[1])))
        R = ArrowTypes.JuliaType(Val(Symbol(right[1])))
        return Interval{T,L,R}(
            L === Unbounded ? nothing : left[2],
            R === Unbounded ? nothing : right[2],
        )
    end
end

# ArrowTypes.fromarrow(Interval, ArrowTypes.toarrow(table.col[1]))

table.col
t = Arrow.Table(Arrow.tobuffer(table))
t.col
julia> table.col
2-element Vector{Interval{Int64, L, R} where {L<:Bound, R<:Bound}}:
 Interval{Int64, Closed, Unbounded}(1, nothing)
 Interval{Int64, Unbounded, Closed}(nothing, 2)

julia> t = Arrow.Table(Arrow.tobuffer(table))
Arrow.Table with 2 rows, 1 columns, and schema:
 :col  Interval

julia> t.col
2-element Arrow.Struct{Interval, Tuple{Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}, Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}}}:
 Interval{Int64, Closed, Unbounded}(1, nothing)
 Interval{Int64, Unbounded, Closed}(nothing, 2)

The main issue I had with implementing this is that the serialized instance as defined by toarrow is not what is passed into fromarrow. In my particular case I just needed to ensure that the NamedTuple I created in toarrow had the same fieldcount as Interval.

omus avatar Jul 07 '21 15:07 omus

Closing as I think we have all the tools in place to support this kind of use-case, even if it's not the most convenient. i.e. the arrow format is really built for pretty homogenous data within the bounds of individual columns, but beyond that, it doesn't fare very well with mixed-type kinds of columns.

quinnj avatar Jun 14 '23 21:06 quinnj