narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

[Enh]: enable `sum_horizontal` to work for boolean (or non-numeric) values in pandas, pyarrow etc

Open anopsy opened this issue 1 year ago • 3 comments

Describe the bug

sum_horizontal works for boolean values only in polars and not in pandas and pyarrow.

Steps or code to reproduce the bug

For pyarrow

import pyarrow as pa
import narwhals as nw
df_pa = pa.table({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pa)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))

>>>ArrowInvalid: Could not convert 0 with type int: tried to convert to boolean

For pandas

import pandas as pd
import narwhals as nw
df_pd = pd.DataFrame({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pd)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))
	a
0	True
1	True
2	False

Expected results

Just as it works in polars:

import polars as pl
import narwhals as nw
df_pl = pl.DataFrame({"a":[True, True, False], "b":[False, False, False]})
df_nw = nw.from_native(df_pl)
nw.to_native(df_nw.select(nw.sum_horizontal(nw.all())))
sum
u32
1
1
0

Actual results

Either error in pyarrow: >>>ArrowInvalid: Could not convert 0 with type int: tried to convert to boolean or this in pandas

	a
0	True
1	True
2	False

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0]
executable: /usr/bin/python3
   machine: Linux-6.8.0-45-generic-x86_64-with-glibc2.35

Python dependencies:
     narwhals: 1.9.0
       pandas: 2.2.1
       polars: 0.20.31
         cudf: 
        modin: 
      pyarrow: 16.1.0
        numpy: 1.26.4

Relevant log output

No response

anopsy avatar Oct 04 '24 15:10 anopsy

I would aim to avoid implicit casting for the user and from our side add a warning that sum_horizontal is for numeric datatypes (e.g. in polars that works for strings as well, but not in narwhals).

The user can always explicitly cast to a numeric datatype before summing:

@nw.narwhalify
def func(df):
    return df.select(
-         nw.sum_horizontal(nw.all())
+         nw.sum_horizontal(nw.all().cast(nw.UInt8()))  # Or whatever datatype
    )

I would not have a problem with having a subset of polars in this particular case

FBruzzesi avatar Oct 04 '24 19:10 FBruzzesi

Yeah, let's unify the output and add the warning. Anyways, I think it's a handy feature to sum booleans. Could you explain in more detail why would you like to avoid it (asking because I'm curious)?

anopsy avatar Oct 05 '24 10:10 anopsy

Yeah, let's unify the output and add the warning. Anyways, I think it's a handy feature to sum booleans. Could you explain in more detail why would you like to avoid it (asking because I'm curious)?

Maybe it is simpler than expected and we could get away with it with:

def cast_if_boolean(s: PandasLikeSeries) -> PandasLikeSeries:
    return (
        s.cast(self._dtype.UInt8())
        if s.dtype == self._dtype.Boolean()
        else s
    )

and then

def func(df: PandasLikeDataFrame) -> list[PandasLikeSeries]:
-    series = (s.fill_null(0) for _expr in parsed_exprs for s in _expr._call(df))
+    series = (s for _expr in parsed_exprs for s in _expr._call(df))
+    series = (s.pipe(cast_if_boolean).fill_null(0) for s in series)
    return [reduce(lambda x, y: x + y, series)]

Thoughts on this?

FBruzzesi avatar Oct 05 '24 10:10 FBruzzesi

I think implicit casting is fine, so long as the rules are well-defined and documented and are applied consistently - for example, what happens when summing two Boolean series? (Can't check right now as I'm on my phone)

MarcoGorelli avatar Oct 15 '24 12:10 MarcoGorelli

If you mean in polars, then nulls are propagated on series, but considered as zero in sum_horizontal - so I think the approach above is relevant:

import polars as pl

s1 = pl.Series([True, True, True, False, False, False,  None, None, None])
s2 = pl.Series([True, False, None, True, False, None, True, False, None])

s1 + s2

shape: (9,)
Series: '' [u32]
[
        2
        1
        null
        1
        0
        null
        null
        null
        null
]

df = pl.DataFrame({"s1": s1, "s2": s2})
df.select(pl.sum_horizontal("s1", "s2"))

shape: (9, 1)
┌─────┐
│ s1  │
│ --- │
│ u32 │
╞═════╡
│ 2   │
│ 1   │
│ 1   │
│ 1   │
│ 0   │
│ 0   │
│ 1   │
│ 0   │
│ 0   │
└─────┘

FBruzzesi avatar Oct 15 '24 13:10 FBruzzesi

thanks! i think we should take our time with this one and have a broader discussion about upcasting in other operations, i'm not sure we're totally consistent at the moment

MarcoGorelli avatar Oct 16 '24 11:10 MarcoGorelli

Pretty sure this would be the starting point for defining rules for pyarrow:

  • https://github.com/narwhals-dev/narwhals/pull/2007#issuecomment-2660308933

I think implicit casting is fine, so long as the rules are well-defined and documented and are applied consistently

I'm +0 on the idea of implicit casting generally, but can see an argument for unifying on the polars type system.

If we're strictly talking about eager Series, then you could better reason with this if Series.dtype were generic: https://github.com/narwhals-dev/narwhals/blob/825e027a72613396da974a456c977fc216a0132d/narwhals/series.py#L341-L343

As long as you can reveal the type like this: https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L275-L276

Then you can start defining methods that only operate over specfic types https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L270

or those that transform them: https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L199 https://github.com/narwhals-dev/narwhals/blob/8d13198f3aca34e7e38f690da09e3466fbb29cc1/narwhals/_arrow/series.py#L291-L293


I don't think this translates well to Expr though, which might complicate things

dangotbanned avatar Feb 17 '25 10:02 dangotbanned