[Enh]: enable `sum_horizontal` to work for boolean (or non-numeric) values in pandas, pyarrow etc
Describe the bug
sum_horizontal works for boolean values only in polars and not in pandas and pyarrow.
Steps or code to reproduce the bug
For pyarrow
import pyarrow as pa
import narwhals as nw
df_pa = pa.table({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pa)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))
>>>ArrowInvalid: Could not convert 0 with type int: tried to convert to boolean
For pandas
import pandas as pd
import narwhals as nw
df_pd = pd.DataFrame({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pd)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))
a
0 True
1 True
2 False
Expected results
Just as it works in polars:
import polars as pl
import narwhals as nw
df_pl = pl.DataFrame({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pl)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))
sum
u32
1
1
0
Actual results
Either error in pyarrow: >>>ArrowInvalid: Could not convert 0 with type int: tried to convert to boolean or this in pandas
a
0 True
1 True
2 False
Please run narwhals.show_version() and enter the output below.
System:
python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
executable: /usr/bin/python3
machine: Linux-6.8.0-45-generic-x86_64-with-glibc2.35
Python dependencies:
narwhals: 1.9.0
pandas: 2.2.1
polars: 0.20.31
cudf:
modin:
pyarrow: 16.1.0
numpy: 1.26.4
Relevant log output
No response
I would aim to avoid implicit casting for the user and from our side add a warning that sum_horizontal is for numeric datatypes (e.g. in polars that works for strings as well, but not in narwhals).
The user can always explicitly cast to a numeric datatype before summing:
@nw.narwhalify
def func(df):
return df.select(
- nw.sum_horizontal(nw.all())
+ nw.sum_horizontal(nw.all().cast(nw.UInt8())) # Or whatever datatype
)
I would not have a problem with having a subset of polars in this particular case
Yeah, let's unify the output and add the warning. Anyways, I think it's a handy feature to sum booleans. Could you explain in more detail why would you like to avoid it (asking because I'm curious)?
Yeah, let's unify the output and add the warning. Anyways, I think it's a handy feature to sum booleans. Could you explain in more detail why would you like to avoid it (asking because I'm curious)?
Maybe it is simpler than expected and we could get away with it with:
def cast_if_boolean(s: PandasLikeSeries) -> PandasLikeSeries:
return (
s.cast(self._dtype.UInt8())
if s.dtype == self._dtype.Boolean()
else s
)
and then
def func(df: PandasLikeDataFrame) -> list[PandasLikeSeries]:
- series = (s.fill_null(0) for _expr in parsed_exprs for s in _expr._call(df))
+ series = (s for _expr in parsed_exprs for s in _expr._call(df))
+ series = (s.pipe(cast_if_boolean).fill_null(0) for s in series)
return [reduce(lambda x, y: x + y, series)]
Thoughts on this?
I think implicit casting is fine, so long as the rules are well-defined and documented and are applied consistently - for example, what happens when summing two Boolean series? (Can't check right now as I'm on my phone)
If you mean in polars, then nulls are propagated on series, but considered as zero in sum_horizontal - so I think the approach above is relevant:
import polars as pl
s1 = pl.Series([True, True, True, False, False, False, None, None, None])
s2 = pl.Series([True, False, None, True, False, None, True, False, None])
s1 + s2
shape: (9,)
Series: '' [u32]
[
2
1
null
1
0
null
null
null
null
]
df = pl.DataFrame({"s1": s1, "s2": s2})
df.select(pl.sum_horizontal("s1", "s2"))
shape: (9, 1)
┌─────┐
│ s1 │
│ --- │
│ u32 │
╞═════╡
│ 2 │
│ 1 │
│ 1 │
│ 1 │
│ 0 │
│ 0 │
│ 1 │
│ 0 │
│ 0 │
└─────┘
thanks! i think we should take our time with this one and have a broader discussion about upcasting in other operations, i'm not sure we're totally consistent at the moment
Pretty sure this would be the starting point for defining rules for pyarrow:
- https://github.com/narwhals-dev/narwhals/pull/2007#issuecomment-2660308933
I think implicit casting is fine, so long as the rules are well-defined and documented and are applied consistently
I'm +0 on the idea of implicit casting generally, but can see an argument for unifying on the polars type system.
If we're strictly talking about eager Series, then you could better reason with this if Series.dtype were generic:
https://github.com/narwhals-dev/narwhals/blob/825e027a72613396da974a456c977fc216a0132d/narwhals/series.py#L341-L343
As long as you can reveal the type like this: https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L275-L276
Then you can start defining methods that only operate over specfic types https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L270
or those that transform them: https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L199 https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L291-L293
I don't think this translates well to Expr though, which might complicate things