pandas
pandas copied to clipboard
BUG: Arrow Binary View Types Don't Print When containing missing values
Pandas version checks
-
[X] I have checked that this issue has not already been reported.
-
[X] I have confirmed this bug exists on the latest version of pandas.
-
[X] I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Note that this example produces output:
In [10]: import pandas as pd
...: import pyarrow as pa
...:
...: ser = pd.Series(["foo", "longer_than_binary_prefix", ''], dtype=pd.ArrowDtype(pa.string_view()))
In [11]: ser
Out[11]:
0 foo
1 longer_than_binary_prefix
2
dtype: string_view[pyarrow]
While this does not:
In [12]: import pandas as pd
...: import pyarrow as pa
...:
...: ser = pd.Series(["foo", "longer_than_binary_prefix", None], dtype=pd.ArrowDtype(pa.string_view()))
In [13]: ser
Out[13]:
This might actually be an upstream bug with pyarrow (@jorisvandenbossche typically knows best)
### Issue Description
Values are not printing
### Expected Behavior
Values should print
### Installed Versions
In [14]: pa.__version__
Out[14]: '17.0.0'
In [15]: pd.__version__
Out[15]: '2.2.3+44.g3dfa33cf2d'
Quickly checking, if I call to_string() explicitly, it does error:
In [12]: ser.to_string()
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:1458, in ArrowExtensionArray.to_numpy(self, dtype, copy, na_value)
1456 mask = data.isna()
1457 result[mask] = na_value
-> 1458 result[~mask] = data[~mask]._pa_array.to_numpy()
1459 return result
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:591, in ArrowExtensionArray.__getitem__(self, item)
589 return self.take(item)
590 elif item.dtype.kind == "b":
--> 591 return type(self)(self._pa_array.filter(item))
592 else:
593 raise IndexError(
594 "Only integers, slices and integer or "
595 "boolean arrays are valid indices."
596 )
File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/table.pxi:959, in pyarrow.lib.ChunkedArray.filter()
File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:264, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
262 if args and isinstance(args[0], Expression):
263 return Expression._call(func_name, list(args), options)
--> 264 return func.call(args, options, memory_pool)
File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()
File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()
File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()
ArrowNotImplementedError: Function 'array_filter' has no kernel matching input types (string_view, bool)
Essentially because it tries to convert to numpy, and that part is failing (because of a kernel not being implemented for string_view).
Some quick thoughts:
- Ideally
to_numpy()would work regardless of filter being implemented or not (although hopefully a next pyarrow release will support this) - The empty repr is quite confusing .. Probably printing something like
pandas.Series <exception occurred while creating the repr>would be more useful?
- Probably printing something like
pandas.Series <exception occurred while creating the repr>would be more useful?
Makes sense for the series, but would this affect the repr when contained within a dataframe?
@WillAyd - should the title be Arrow String View? Want to make sure I'm understanding the issue.
I don't think so - binary view is the terminology used by the arrow specification, which generally covers what you may be thinking of as bytes and strings:
https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout
The same issue occurs with the binary_view pyarrow type as well
I found a flavor of this that I've documented in #60068 , it seems like it's more general than if there are missing values. ArrowDtype.type property accessor does not have a case to handle pa.string_view / pa.binary_view
If you just want a quick fix:
import pandas as pd
import pyarrow as pa
old_ArrowDtype_type = pd.ArrowDtype.type
@property
def ArrowDtype_type(self):
if pa.types.is_string_view(self.pyarrow_dtype):
return str
return old_ArrowDtype_type(self)
pd.ArrowDtype.type = ArrowDtype_type
In [1]: tab = pa.table({"names": pa.array(["a", "b", "c"], type=pa.string_view())})
...:
...: tab.to_pandas(types_mapper=pd.ArrowDtype)
Out[1]:
names
0 a
1 b
2 c