feat: Add support for `Expr.str.zfill`
This adds zfill for both Series and Expr
What type of PR is this? (check all applicable)
- [ ] 💾 Refactor
- [x] ✨ Feature
- [ ] 🐛 Bug Fix
- [ ] 🔧 Optimization
- [ ] 📝 Documentation
- [ ] ✅ Test
- [ ] 🐳 Other
Related issues
- Related issue #<issue number>
- Closes #2596
Checklist
- [ ] Code follows style guide (ruff)
- [ ] Tests added
- [ ] Documented the changes
If you have comments or can explain your changes, please do so below
Working through this:
check-api-reference....................................................................Failed
- hook id: check-api-reference
- exit code: 1
Series.str: not documented
{'zfill'}
Expr.str: not documented
{'zfill'}
ah yeah you need to add it to the expr_str.md / series_str.md pages
Thanks. On my way now. Will add some tests when I log on later
Is this a bug in pandas?
import pandas as pd
import pyarrow as pa
pd.Series(["A", "AB", "ABC"], dtype=pd.ArrowDtype(pa.string())).str.zfill(3)
Trackback:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 1818, in zfill
result = self._data.array._str_map(f)
^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
Works fine with pd.Series(["A", "AB", "ABC"], dtype="string[pyarrow]") though but that seems different dtype. Reference here
I've created a pandas issue here https://github.com/pandas-dev/pandas/issues/61485
How do y'all do the CASE WHEN. Are there any examples? Or maybe you have a different way of implementing in mind
How do y'all do the CASE WHEN. Are there any examples?
@williambdean sure there are a lot of cases and it differs from backend to backend - #2549 has examples for lazy backends
Or maybe you have a different way of implementing in mind
Are you referring to the trailing dash comment? If so, I guess it depends 😅 pandas already does that out of the box - I have no clue for the other backends though
In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for
CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE
what are your thoughts?
In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE END
what are your thoughts?
Yes it seems very reasonable to do so 🙌🏼 - the log PR have a similar case of 2-3 conditions needed to be checked
Awesome. Thanks for the lead. Will look into that example
IMHO it's OK to xfail a pandas test here, it looks like the rest is mostly ready? we can always make a follow-up to work around the pandas bug if necessary
Sounds good. The case that I will finish with is the behavior with leading + or - for the implementations that are using lpad
I've added a test case for the leading +. zfill from python does "+1".zfill(3) = "+01". But it seems like the polars eager does something different
These are the current failures with the latest commits:
===================================================== short test summary info =====================================================
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[ibis] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[sqlframe] - AssertionError: Mismatch at index 0: 0-1 != -01
Which behavior is suppose to match?
Seeing some differences
[ins] In [4]: pd.Series(["+1", "++1"]).str.zfill(4)
Out[4]:
0 +001
1 +0+1
dtype: object
[ins] In [5]: import polars as pl
[ins] In [6]: pl.Series(["+1", "++1"]).str.zfill(4)
Out[6]:
shape: (2,)
Series: '' [str]
[
"00+1"
"0++1"
]
[ins] In [7]: "+1".zfill(4)
Out[7]: '+001'
[ins] In [8]: "++1".zfill(4)
Out[8]: '+0+1'
Probably good to match polars
well pandas and polars don't match 😆
I don't really understand the pandas output, but we can standardise on what polars does and note the pandas difference in the docstring? This feels like really edge case behavior anyway
I've adjusted the tests and the implementations to match polars behavior. I have a case for pandas.
I'm new to arrow and still working through that implmenetation. Open to pointers!
Thanks for the review. I will add those changes.
Do you know how to do the arrow implementation?
Arrow test failure
def zfill(self, width: int) -> ArrowSeries:
length = pc.utf8_length(self.native)
less_than_width = pc.less(length, width)
starts_with_minus = pc.equal(self.slice(0, 1).native, lit("-"))
result = pc.case_when(
[
(
> starts_with_minus & less_than_width,
pc.utf8_lpad(
self.slice(1, None).native, width - 2, padding="0"
).concat(lit("-")),
),
(less_than_width, pc.utf8_lpad(self.native, width=width, padding="0")),
],
self.native,
)
E TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'
narwhals/_arrow/series_str.py:72: TypeError
===================================================== short test summary info =====================================================
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pyarrow] - TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pyarrow] - TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'
https://github.com/narwhals-dev/narwhals/pull/2598#issuecomment-2913161818
@williambdean hopefully (https://github.com/narwhals-dev/narwhals/pull/2598/commits/76373df554356ac3695eb1177dee18177d4fa6a3) will do the trick 🤞
The only example I found for pc.case_when was in these tests:
https://github.com/apache/arrow/blob/1cabcb32326ccb5e227e8a0574d6a81bcbde6f12/python/pyarrow/tests/test_compute.py#L3280-L3284
Thanks for the help @dangotbanned !
Thanks for the help @dangotbanned !
Happy to help, all yours 🙌
Ha, seems like the pandas behavior has even changed in the versions that are being tested? Confusing. https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670950?pr=2598
The edits you made worked for me locally, but seeing some pyarrow failures: https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670950?pr=2598
And would need to implement the dask behavior as well: https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670953?pr=2598
EDIT: push up dask implementation in 86d9a599
Ha, seems like the pandas behavior has even changed in the versions that are being tested?
Is that just for old versions? If so, fine to skip the test for old versions of pandas IMO. Else, if you feel like it, reimplement zfill like you did for other backends with pad (so long as you just use .str methods, and not apply / map)
It seems to just be the https://github.com/narwhals-dev/narwhals/actions/workflows/extremes.yml workflow now. Which gets caught in the random runs as well:
https://github.com/narwhals-dev/narwhals/actions/runs/15283218688/job/42988106594
i'm not too bothered about what happens for '++1', but for '+1' at least i think the Polars output doesn't look correct
ok this is marked as "accepted" in Polars https://github.com/pola-rs/polars/issues/22983
So I'd say:
- for
'+1'.zfill(3), the output should be'+01'. We may need to xfail for Polars for now - for
'++1'.zfill(4), i'd say that the behaviour is unspecified and we shouldn't worry about what happens
sorry for contradictory instructions from me @williambdean , my bad here
Fixed in Polars now https://github.com/pola-rs/polars/pull/22985#event-17849747949
Thanks for the eyes and work here. Let me look this over and make the edits
is there an example of xfail for a specific constructor or do I just go off the names like I have been doing?
yeah plenty of examples in the tests, if you search requests.applymarker(pytest.mark.xfail) you should find a few
ok this is marked as "accepted" in Polars pola-rs/polars#22983
So I'd say:
* for `'+1'.zfill(3)`, the output should be `'+01'`. We may need to xfail for Polars for now
Would it make sense to provide compatibility for earlier polars versions?
There are other methods we've got that for, and this PR has a lot of examples of what that could look like from the other backends currently