arrow-julia icon indicating copy to clipboard operation
arrow-julia copied to clipboard

Creating DictEncoded in the presence of missing values

Open dmbates opened this issue 3 years ago • 2 comments

When, e.g. a PooledArray column, that contains missing values is converted to DictEncoded the dictionary is based on the result of DataAPI.refpool, which includes missing. As a result both the dictionary and the index vector contain missing values, which confuses Pandas. The missing value in the dictionary can be skipped because it is never referenced in the index vector.

julia> using Arrow, DataAPI, PooledArrays

julia> tbl = (; a = PooledArray([missing, "a", "b", "a"]))
(a = Union{Missing, String}[missing, "a", "b", "a"],)

julia> DataAPI.refarray(tbl.a)
4-element Vector{UInt32}:
 0x00000001
 0x00000002
 0x00000003
 0x00000002

julia> DataAPI.refpool(tbl.a)
3-element Vector{Union{Missing, String}}:
 missing
 "a"
 "b"

julia> Arrow.write("tbl.arrow", tbl)
"tbl.arrow"

In the read_table result we see that there is a null in the dictionary at Python index 0 that is never referenced in the indices vector.

$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.feather as fea
>>> fea.read_table("tbl.arrow")
pyarrow.Table
a: dictionary<values=string, indices=int8, ordered=0>
----
a: [  -- dictionary:
[null,"a","b"]  -- indices:
[null,1,2,1]]
>>> fea.read_feather('nyc_mv_collisions_202201.arrow')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py", line 231, in read_feather
    return (read_table(
  File "pyarrow/array.pxi", line 823, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 818, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in _table_to_blocks
    return [_reconstruct_block(item, columns, extension_columns)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in <listcomp>
    return [_reconstruct_block(item, columns, extension_columns)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 757, in _reconstruct_block
    cat = _pandas_api.categorical_type.from_codes(
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 687, in from_codes
    dtype = CategoricalDtype._from_values_or_dtype(
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 299, in _from_values_or_dtype
    dtype = CategoricalDtype(categories, ordered)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 186, in __init__
    self._finalize(categories, ordered, fastpath=False)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 340, in _finalize
    categories = self.validate_categories(categories, fastpath=fastpath)
  File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 534, in validate_categories
    raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null

One possible approach is to check for missing in the refpool, find its index in the refpool, delete it from the refpool and rewrite the refarray to replace that index by missing.

dmbates avatar Nov 09 '22 15:11 dmbates

I see that the Arrow Columnar Format section of the Arrow docs explicitly says that duplicates and null values are allowed in the dictionary but the null count is always the number of nulls in the index array.

Because the index of any null in the dictionary is replaced by null in the index array, nulls in the dictionary are never referenced by an index. It seems that it would be more effective to adopt the Python convention and remove the null from the dictionary after propagating it to the index array.

The current result is not "wrong" according to the Format description but it is awkward.

dmbates avatar Nov 11 '22 17:11 dmbates

Hmmmm.....yeah, this is a tough one. Ideally, they would support this since the format explicitly allows. I'll see if I can play around with this a bit, but from an initial stab, it's not as trivial as I hoped.

quinnj avatar Nov 29 '22 06:11 quinnj