patch: `sum_horizontal` ignore nulls
What type of PR is this? (check all applicable)
- [ ] 💾 Refactor
- [ ] ✨ Feature
- [x] 🐛 Bug Fix
- [ ] 🔧 Optimization
- [ ] 📝 Documentation
- [ ] ✅ Test
- [ ] 🐳 Other
Related issues
- Closes #822
Checklist
- [x] Code follows style guide (ruff)
- [x] Tests added
- [ ] Documented the changes
If you have comments or can explain your changes, please do so below.
Not sure how to handle non numeric 🤔
In horizontal sum, polars concatenates strings and fill nulls with empty strings, but also concatenates strings with numeric:
df = pl.DataFrame({
"a": [1, 8, 3],
"c": ["x", "y", "z"],
})
df.select(pl.sum_horizontal("a", "c"))
shape: (3, 1)
┌─────┐
│ a │
│ --- │
│ str │
╞═════╡
│ 1x │
│ 8y │
│ 3z │
└─────┘
To extract the dtype in narwhals, we would need to evaluate the expr, or am I missing something trivial?
Not sure how to handle non numeric 🤔
probably fine to ignore them for now? or even, ignore them completely. we could be a bit stricter than Polars here, and guide people towards https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html to concatenate strings
probably fine to ignore them for now? or even, ignore them completely. we could be a bit stricter than Polars here, and guide people towards https://docs.pola.rs/api/python/stable/reference/expressions/api/polars.concat_str.html to concatenate strings
Oh neat, then maybe we want to support concat_str first. As the request for null values came from me, I don't feel like we are in a rush to merge this.
then maybe we want to support
concat_strfirst
Well it turns out that I am facing one of the very painful pandas issue:
import pandas as pd
df = pd.DataFrame({"c1": [1,2,3], "c2": ["x", "y", None], "c3": list("abc")})
df["c2"] + df["c3"]
0 xa
1 yb
2 NaN
dtype: object
but to replicate polars behavior I was casting to string first, and we need nullable string type, otherwise:
df["c1"].astype(str) + df["c2"].astype(str)
0 1x
1 2y
2 3None
dtype: object
I would imagine same issue will present with dask
This would official be a regression. I searched on github if anyone is using the feature on a public repo and ended up empty handed, which is a good sign.
I also added a warning in the docs as the following:
I am moving it to ready for review, but very happy to see if anyone has ideas on how to deal in case of strings