arrow-julia
arrow-julia copied to clipboard
DataAPI.jl integration questions
I have several questions regarding DataAPI.jl integration of Arrow.jl. They are mostly stemming from the fact that I do not know the details of Arrow.jl implementation so I might be asking about something obvious:
- Why does
DataAPI.refpool(x::DictEncoded) = copy(x.encoding.data)perform a copy? The question is following the fact that this will negatively affect the performance ofgroupbyandjoin*in DataFrames.jl - Similarly we now have
DataAPI.refarray(x::DictEncoded{T, S}) where {T, S} = x.indices .+ one(S)which allocates. In other packages (CategoricalArrays.jl, PooledArrays.jl) we have an implementation that does not allocate (again - allocation will negatively affect performance) - Why in
DataAPI.levels(x::DictEncoded)we do not try sorting the levels? Also it seems that instead ofdeleteat!(rp, ismissing.(rp))we could just usecollectoverskipmissingwrapper (and this combined with non-copyingrp = DataAPI.refpool(x)as suggested above should lend a faster implementation).
Thank you!