arrow-julia
arrow-julia copied to clipboard
Loss of parametric type information for custom types
When using a table where a column contains a variety of types with different parameters this information can be lost:
julia> using Arrow, Intervals
julia> table = (col = [
Interval{Closed,Closed}(1,2),
Interval{Open,Closed}(1,2),
Interval{Closed,Unbounded}(1,nothing),
],);
julia> table.col
3-element Array{Interval{Int64,L,R} where R<:Bound where L<:Bound,1}:
Interval{Int64,Closed,Closed}(1, 2)
Interval{Int64,Open,Closed}(1, 2)
Interval{Int64,Closed,Unbounded}(1, nothing)
julia> Arrow.write("ex.arrow", table);
julia> Arrow.Table("ex.arrow").col
3-element Arrow.Struct{Interval{Int64,L,R} where R<:Bound where L<:Bound,Tuple{Arrow.Primitive{Int64,Array{Int64,1}},Arrow.Primitive{Int64,Array{Int64,1}}}}:
Interval{Int64,Closed,Closed}(1, 2)
Interval{Int64,Closed,Closed}(1, 2)
Interval{Int64,Closed,Closed}(1, 1)
For the particular Interval
type the problem is worse as the undefined type parameters are inferred from the arguments:
julia> (Interval{Int64,L,R} where R<:Bound where L<:Bound)(1,2)
Interval{Int64,Closed,Closed}(1, 2)
Ok, this will now correctly error on #156 PR.
Using ArrowTypes.arrowmetadata
as shown in the Intervals example in the documentation you can only serialize a column where all of the type parameters are the same. Having mixture of type parameters does not work:
table = (col = [
Interval{Closed,Unbounded}(1,nothing),
Interval{Unbounded,Closed}(nothing,2),
],)
I attempted to work around this on Arrow 1.6 (not quite yet released) by storing the parametric information as part of the value as using ArrowTypes.arrowmetadata
can't handle element variation. An implementation of this looks like:
using Arrow, ArrowTypes, Intervals
table = (;
col=[
Interval{Closed,Unbounded}(1,nothing),
Interval{Unbounded,Closed}(nothing,2),
]
)
for T in (Closed, Open, Unbounded)
name = QuoteNode(Symbol("JuliaLang.Intervals.$(string(T))"))
@eval begin
ArrowTypes.arrowname(::Type{$T}) = $name
ArrowTypes.JuliaType(::Val{$name}) = $T
end
end
let name = Symbol("JuliaLang.Intervals.Interval")
ArrowTypes.arrowname(::Type{<:Interval{T}}) where T = name
ArrowTypes.ArrowType(::Type{<:Interval{T}}) where T = NamedTuple{(:left, :right), Tuple{Tuple{String, T}, Tuple{String, T}}}
function ArrowTypes.toarrow(x::Interval{T,L,R}) where {T,L,R}
return (; left=(string(arrowname(L)), x.first), right=(string(arrowname(R)), x.last))
end
ArrowTypes.JuliaType(::Val{name}) = Interval
function ArrowTypes.fromarrow(::Type{Interval}, left, right)
T = typeof(left[2])
L = ArrowTypes.JuliaType(Val(Symbol(left[1])))
R = ArrowTypes.JuliaType(Val(Symbol(right[1])))
return Interval{T,L,R}(
L === Unbounded ? nothing : left[2],
R === Unbounded ? nothing : right[2],
)
end
end
# ArrowTypes.fromarrow(Interval, ArrowTypes.toarrow(table.col[1]))
table.col
t = Arrow.Table(Arrow.tobuffer(table))
t.col
julia> table.col
2-element Vector{Interval{Int64, L, R} where {L<:Bound, R<:Bound}}:
Interval{Int64, Closed, Unbounded}(1, nothing)
Interval{Int64, Unbounded, Closed}(nothing, 2)
julia> t = Arrow.Table(Arrow.tobuffer(table))
Arrow.Table with 2 rows, 1 columns, and schema:
:col Interval
julia> t.col
2-element Arrow.Struct{Interval, Tuple{Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}, Arrow.Struct{Tuple{String, Int64}, Tuple{Arrow.List{String, Int32, Vector{UInt8}}, Arrow.Primitive{Int64, Vector{Int64}}}}}}:
Interval{Int64, Closed, Unbounded}(1, nothing)
Interval{Int64, Unbounded, Closed}(nothing, 2)
The main issue I had with implementing this is that the serialized instance as defined by toarrow
is not what is passed into fromarrow
. In my particular case I just needed to ensure that the NamedTuple
I created in toarrow
had the same fieldcount
as Interval
.
Closing as I think we have all the tools in place to support this kind of use-case, even if it's not the most convenient. i.e. the arrow format is really built for pretty homogenous data within the bounds of individual columns, but beyond that, it doesn't fare very well with mixed-type kinds of columns.