arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Missing values are not handled when converting `DictEncoded` to `PooledArray` via `copy`

Open junyuan-chen opened this issue 4 years ago • 0 comments

When copying a column of type DictEncoded, the missing values in it are not handled. The resulting PooledArray does not have missing in its pool. Because of that, if one tries to access the element that is supposed to be missing, an UndefRefError is raised.

Here is an illustration with an example data file: cat_with_missing.feather.zip

using Arrow
tb = Arrow.Table("cat_with_missing.feather")
julia> tb.A
3-element Arrow.DictEncoded{Union{Missing, String}, Int8, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:
 missing
 "a"
 "b"

julia> tb.A[1]
missing

julia> A = copy(tb.A)
3-element PooledArrays.PooledVector{Union{Missing, String}, Int8, Vector{Int8}}:
 #undef
    "a"
    "b"

julia> A[1]
ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex(A::PooledArrays.PooledVector{Union{Missing, String}, Int8, Vector{Int8}}, I::Int64)
   @ PooledArrays ~/.julia/packages/PooledArrays/CV8kA/src/PooledArrays.jl:451
   ...

The current implementation of copy directly uses the encoding data as the pool, but that pool does not contain missing:

julia> tb.A.encoding.data
2-element Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}:
 "a"
 "b"

This seems to be related to ongoing work on changing how the missing values should be represented in a PooledArray.

junyuan-chen avatar Jun 30 '21 18:06 junyuan-chen