Creating DictEncoded in the presence of missing values
When, e.g. a PooledArray column, that contains missing values is converted to DictEncoded the dictionary is based on the result of DataAPI.refpool, which includes missing. As a result both the dictionary and the index vector contain missing values, which confuses Pandas. The missing value in the dictionary can be skipped because it is never referenced in the index vector.
julia> using Arrow, DataAPI, PooledArrays
julia> tbl = (; a = PooledArray([missing, "a", "b", "a"]))
(a = Union{Missing, String}[missing, "a", "b", "a"],)
julia> DataAPI.refarray(tbl.a)
4-element Vector{UInt32}:
0x00000001
0x00000002
0x00000003
0x00000002
julia> DataAPI.refpool(tbl.a)
3-element Vector{Union{Missing, String}}:
missing
"a"
"b"
julia> Arrow.write("tbl.arrow", tbl)
"tbl.arrow"
In the read_table result we see that there is a null in the dictionary at Python index 0 that is never referenced in the indices vector.
$ python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyarrow.feather as fea
>>> fea.read_table("tbl.arrow")
pyarrow.Table
a: dictionary<values=string, indices=int8, ordered=0>
----
a: [ -- dictionary:
[null,"a","b"] -- indices:
[null,1,2,1]]
>>> fea.read_feather('nyc_mv_collisions_202201.arrow')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/feather.py", line 231, in read_feather
return (read_table(
File "pyarrow/array.pxi", line 823, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 3913, in pyarrow.lib.Table._to_pandas
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 818, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 1170, in <listcomp>
return [_reconstruct_block(item, columns, extension_columns)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pyarrow/pandas_compat.py", line 757, in _reconstruct_block
cat = _pandas_api.categorical_type.from_codes(
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/arrays/categorical.py", line 687, in from_codes
dtype = CategoricalDtype._from_values_or_dtype(
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 299, in _from_values_or_dtype
dtype = CategoricalDtype(categories, ordered)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 186, in __init__
self._finalize(categories, ordered, fastpath=False)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 340, in _finalize
categories = self.validate_categories(categories, fastpath=fastpath)
File "/home/bates/.julia/conda/3/lib/python3.10/site-packages/pandas/core/dtypes/dtypes.py", line 534, in validate_categories
raise ValueError("Categorical categories cannot be null")
ValueError: Categorical categories cannot be null
One possible approach is to check for missing in the refpool, find its index in the refpool, delete it from the refpool and rewrite the refarray to replace that index by missing.
I see that the Arrow Columnar Format section of the Arrow docs explicitly says that duplicates and null values are allowed in the dictionary but the null count is always the number of nulls in the index array.
Because the index of any null in the dictionary is replaced by null in the index array, nulls in the dictionary are never referenced by an index. It seems that it would be more effective to adopt the Python convention and remove the null from the dictionary after propagating it to the index array.
The current result is not "wrong" according to the Format description but it is awkward.
Hmmmm.....yeah, this is a tough one. Ideally, they would support this since the format explicitly allows. I'll see if I can play around with this a bit, but from an initial stab, it's not as trivial as I hoped.