modin icon indicating copy to clipboard operation
modin copied to clipboard

Interchange `Column.null_count` is a NumPy scalar, not a builtin `int`

Open honno opened this issue 2 years ago • 1 comments

A PandasProtocolColumn returns a null_count as a 0d integer array (specifically a NumPy scalar), as opposed to int as specified in the interchange protocol.

>>> from modin import pandas as mpd
>>> df = pd.DataFrame({"foo": [42]})
>>> interchange_df = df.__dataframe__()
>>> interchange_col = interchange_df.get_column_by_name("foo")
>>> interchange_col.null_count
0
>>> type(interchange_col.null_count)
numpy.int64  # should be Python's int

This seems to be because the null_count implementation uses DataFrame.squeeze(), which returns a NumPy scalar rather than an int.

https://github.com/modin-project/modin/blob/9b33451648a3192e93c46ac6961627ed2858c7fd/modin/core/dataframe/pandas/exchange/dataframe_protocol/column.py#L222-L245

Related https://github.com/pandas-dev/pandas/issues/47789

honno avatar Jul 20 '22 10:07 honno

@honno thank you for reporting the issue. I can reproduce it with your code at 5af9832d7fad3d17f05d63908bc377e61542d953.

mvashishtha avatar Jul 20 '22 14:07 mvashishtha