arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

DataAPI.jl integration questions

Open bkamins opened this issue 4 years ago • 0 comments

I have several questions regarding DataAPI.jl integration of Arrow.jl. They are mostly stemming from the fact that I do not know the details of Arrow.jl implementation so I might be asking about something obvious:

  1. Why does DataAPI.refpool(x::DictEncoded) = copy(x.encoding.data) perform a copy? The question is following the fact that this will negatively affect the performance of groupby and join* in DataFrames.jl
  2. Similarly we now have DataAPI.refarray(x::DictEncoded{T, S}) where {T, S} = x.indices .+ one(S) which allocates. In other packages (CategoricalArrays.jl, PooledArrays.jl) we have an implementation that does not allocate (again - allocation will negatively affect performance)
  3. Why in DataAPI.levels(x::DictEncoded) we do not try sorting the levels? Also it seems that instead of deleteat!(rp, ismissing.(rp)) we could just use collect over skipmissing wrapper (and this combined with non-copying rp = DataAPI.refpool(x) as suggested above should lend a faster implementation).

Thank you!

bkamins avatar May 05 '21 06:05 bkamins