arrow-julia
arrow-julia copied to clipboard
Missing values are not handled when converting `DictEncoded` to `PooledArray` via `copy`
When copying a column of type DictEncoded, the missing values in it are not handled. The resulting PooledArray does not have missing in its pool. Because of that, if one tries to access the element that is supposed to be missing, an UndefRefError is raised.
Here is an illustration with an example data file: cat_with_missing.feather.zip
using Arrow
tb = Arrow.Table("cat_with_missing.feather")
julia> tb.A
3-element Arrow.DictEncoded{Union{Missing, String}, Int8, Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}}:
missing
"a"
"b"
julia> tb.A[1]
missing
julia> A = copy(tb.A)
3-element PooledArrays.PooledVector{Union{Missing, String}, Int8, Vector{Int8}}:
#undef
"a"
"b"
julia> A[1]
ERROR: UndefRefError: access to undefined reference
Stacktrace:
[1] getindex(A::PooledArrays.PooledVector{Union{Missing, String}, Int8, Vector{Int8}}, I::Int64)
@ PooledArrays ~/.julia/packages/PooledArrays/CV8kA/src/PooledArrays.jl:451
...
The current implementation of copy directly uses the encoding data as the pool, but that pool does not contain missing:
julia> tb.A.encoding.data
2-element Arrow.List{Union{Missing, String}, Int32, Vector{UInt8}}:
"a"
"b"
This seems to be related to ongoing work on changing how the missing values should be represented in a PooledArray.