narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

patch: `sum_horizontal` ignore nulls

Open FBruzzesi opened this issue 1 year ago • 2 comments

What type of PR is this? (check all applicable)

  • [ ] 💾 Refactor
  • [ ] ✨ Feature
  • [x] 🐛 Bug Fix
  • [ ] 🔧 Optimization
  • [ ] 📝 Documentation
  • [ ] ✅ Test
  • [ ] 🐳 Other

Related issues

  • Closes #822

Checklist

  • [x] Code follows style guide (ruff)
  • [x] Tests added
  • [ ] Documented the changes

If you have comments or can explain your changes, please do so below.

Not sure how to handle non numeric 🤔

In horizontal sum, polars concatenates strings and fill nulls with empty strings, but also concatenates strings with numeric:

df = pl.DataFrame({
    "a": [1, 8, 3],
   "c": ["x", "y", "z"],
})

df.select(pl.sum_horizontal("a", "c"))
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ str │
╞═════╡
│ 1x  │
│ 8y  │
│ 3z  │
└─────┘

To extract the dtype in narwhals, we would need to evaluate the expr, or am I missing something trivial?

FBruzzesi avatar Aug 20 '24 15:08 FBruzzesi

Not sure how to handle non numeric 🤔

probably fine to ignore them for now? or even, ignore them completely. we could be a bit stricter than Polars here, and guide people towards https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html to concatenate strings

MarcoGorelli avatar Aug 22 '24 10:08 MarcoGorelli

probably fine to ignore them for now? or even, ignore them completely. we could be a bit stricter than Polars here, and guide people towards https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html to concatenate strings

Oh neat, then maybe we want to support concat_str first. As the request for null values came from me, I don't feel like we are in a rush to merge this.

FBruzzesi avatar Aug 22 '24 10:08 FBruzzesi

then maybe we want to support concat_str first

Well it turns out that I am facing one of the very painful pandas issue:

import pandas as pd

df = pd.DataFrame({"c1": [1,2,3], "c2": ["x", "y", None], "c3": list("abc")})

df["c2"] + df["c3"]
0     xa
1     yb
2    NaN
dtype: object

but to replicate polars behavior I was casting to string first, and we need nullable string type, otherwise:

df["c1"].astype(str) + df["c2"].astype(str)
0       1x
1       2y
2    3None
dtype: object

I would imagine same issue will present with dask

FBruzzesi avatar Aug 23 '24 12:08 FBruzzesi

This would official be a regression. I searched on github if anyone is using the feature on a public repo and ended up empty handed, which is a good sign.

I also added a warning in the docs as the following:

image

I am moving it to ready for review, but very happy to see if anyone has ideas on how to deal in case of strings

FBruzzesi avatar Aug 23 '24 12:08 FBruzzesi