pandas icon indicating copy to clipboard operation
pandas copied to clipboard

BUG: Arrow Binary View Types Don't Print When containing missing values

Open WillAyd opened this issue 1 year ago • 6 comments

Pandas version checks

  • [X] I have checked that this issue has not already been reported.

  • [X] I have confirmed this bug exists on the latest version of pandas.

  • [X] I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

Note that this example produces output:


In [10]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", ''], dtype=pd.ArrowDtype(pa.string_view()))

In [11]: ser
Out[11]: 
0                          foo
1    longer_than_binary_prefix
2                             
dtype: string_view[pyarrow]

While this does not:

In [12]: import pandas as pd
    ...: import pyarrow as pa
    ...: 
    ...: ser = pd.Series(["foo", "longer_than_binary_prefix", None], dtype=pd.ArrowDtype(pa.string_view()))

In [13]: ser
Out[13]: 

This might actually be an upstream bug with pyarrow (@jorisvandenbossche typically knows best)



### Issue Description

Values are not printing

### Expected Behavior

Values should print

### Installed Versions

In [14]: pa.__version__
Out[14]: '17.0.0'

In [15]: pd.__version__
Out[15]: '2.2.3+44.g3dfa33cf2d'

WillAyd avatar Sep 24 '24 21:09 WillAyd

Quickly checking, if I call to_string() explicitly, it does error:

In [12]: ser.to_string()
...
File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:1458, in ArrowExtensionArray.to_numpy(self, dtype, copy, na_value)
   1456     mask = data.isna()
   1457     result[mask] = na_value
-> 1458     result[~mask] = data[~mask]._pa_array.to_numpy()
   1459 return result

File ~/scipy/repos/pandas/pandas/core/arrays/arrow/array.py:591, in ArrowExtensionArray.__getitem__(self, item)
    589     return self.take(item)
    590 elif item.dtype.kind == "b":
--> 591     return type(self)(self._pa_array.filter(item))
    592 else:
    593     raise IndexError(
    594         "Only integers, slices and integer or "
    595         "boolean arrays are valid indices."
    596     )

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/table.pxi:959, in pyarrow.lib.ChunkedArray.filter()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/compute.py:264, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    262 if args and isinstance(args[0], Expression):
    263     return Expression._call(func_name, list(args), options)
--> 264 return func.call(args, options, memory_pool)

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:155, in pyarrow.lib.pyarrow_internal_check_status()

File ~/conda/envs/dev/lib/python3.11/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowNotImplementedError: Function 'array_filter' has no kernel matching input types (string_view, bool)

Essentially because it tries to convert to numpy, and that part is failing (because of a kernel not being implemented for string_view).

Some quick thoughts:

  • Ideally to_numpy() would work regardless of filter being implemented or not (although hopefully a next pyarrow release will support this)
  • The empty repr is quite confusing .. Probably printing something like pandas.Series <exception occurred while creating the repr> would be more useful?

jorisvandenbossche avatar Sep 25 '24 14:09 jorisvandenbossche

  • Probably printing something like pandas.Series <exception occurred while creating the repr> would be more useful?

Makes sense for the series, but would this affect the repr when contained within a dataframe?

WillAyd avatar Sep 25 '24 14:09 WillAyd

@WillAyd - should the title be Arrow String View? Want to make sure I'm understanding the issue.

rhshadrach avatar Sep 30 '24 21:09 rhshadrach

I don't think so - binary view is the terminology used by the arrow specification, which generally covers what you may be thinking of as bytes and strings:

https://arrow.apache.org/docs/format/Columnar.html#variable-size-binary-view-layout

The same issue occurs with the binary_view pyarrow type as well

WillAyd avatar Sep 30 '24 22:09 WillAyd

I found a flavor of this that I've documented in #60068 , it seems like it's more general than if there are missing values. ArrowDtype.type property accessor does not have a case to handle pa.string_view / pa.binary_view

a10y avatar Oct 18 '24 14:10 a10y

If you just want a quick fix:

import pandas as pd
import pyarrow as pa

old_ArrowDtype_type = pd.ArrowDtype.type
@property
def ArrowDtype_type(self):
    if pa.types.is_string_view(self.pyarrow_dtype):
        return str
    return old_ArrowDtype_type(self)
pd.ArrowDtype.type = ArrowDtype_type
In [1]: tab = pa.table({"names": pa.array(["a", "b", "c"], type=pa.string_view())})
   ...: 
   ...: tab.to_pandas(types_mapper=pd.ArrowDtype)
Out[1]:  
  names
0     a
1     b
2     c

danking avatar Oct 18 '24 14:10 danking