narwhals icon indicating copy to clipboard operation
narwhals copied to clipboard

feat: Add support for `Expr.str.zfill`

Open williambdean opened this issue 7 months ago • 31 comments

This adds zfill for both Series and Expr

What type of PR is this? (check all applicable)

  • [ ] 💾 Refactor
  • [x] ✨ Feature
  • [ ] 🐛 Bug Fix
  • [ ] 🔧 Optimization
  • [ ] 📝 Documentation
  • [ ] ✅ Test
  • [ ] 🐳 Other

Related issues

  • Related issue #<issue number>
  • Closes #2596

Checklist

  • [ ] Code follows style guide (ruff)
  • [ ] Tests added
  • [ ] Documented the changes

If you have comments or can explain your changes, please do so below

williambdean avatar May 23 '25 11:05 williambdean

Working through this:

check-api-reference....................................................................Failed
- hook id: check-api-reference
- exit code: 1

Series.str: not documented
{'zfill'}
Expr.str: not documented
{'zfill'}

williambdean avatar May 23 '25 11:05 williambdean

ah yeah you need to add it to the expr_str.md / series_str.md pages

MarcoGorelli avatar May 23 '25 11:05 MarcoGorelli

Thanks. On my way now. Will add some tests when I log on later

williambdean avatar May 23 '25 15:05 williambdean

Is this a bug in pandas?

import pandas as pd
import pyarrow as pa

pd.Series(["A", "AB", "ABC"], dtype=pd.ArrowDtype(pa.string())).str.zfill(3)

Trackback:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 137, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/will/GitHub/various/narwhals/.venv/lib/python3.12/site-packages/pandas/core/strings/accessor.py", line 1818, in zfill
    result = self._data.array._str_map(f)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?

Works fine with pd.Series(["A", "AB", "ABC"], dtype="string[pyarrow]") though but that seems different dtype. Reference here

williambdean avatar May 23 '25 17:05 williambdean

I've created a pandas issue here https://github.com/pandas-dev/pandas/issues/61485

williambdean avatar May 23 '25 20:05 williambdean

How do y'all do the CASE WHEN. Are there any examples? Or maybe you have a different way of implementing in mind

williambdean avatar May 23 '25 20:05 williambdean

How do y'all do the CASE WHEN. Are there any examples?

@williambdean sure there are a lot of cases and it differs from backend to backend - #2549 has examples for lazy backends

Or maybe you have a different way of implementing in mind

Are you referring to the trailing dash comment? If so, I guess it depends 😅 pandas already does that out of the box - I have no clue for the other backends though

FBruzzesi avatar May 23 '25 20:05 FBruzzesi

In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE END

what are your thoughts?

williambdean avatar May 23 '25 20:05 williambdean

In the cases with no implementation of zfill (which seems like a few), I was using lpad. I was thinking on aiming for CASE WHEN starts_with(col, "+") THEN ... WHEN starts_with("-") THEN ... ELSE END

what are your thoughts?

Yes it seems very reasonable to do so 🙌🏼 - the log PR have a similar case of 2-3 conditions needed to be checked

FBruzzesi avatar May 23 '25 20:05 FBruzzesi

Awesome. Thanks for the lead. Will look into that example

williambdean avatar May 23 '25 21:05 williambdean

IMHO it's OK to xfail a pandas test here, it looks like the rest is mostly ready? we can always make a follow-up to work around the pandas bug if necessary

MarcoGorelli avatar May 24 '25 12:05 MarcoGorelli

Sounds good. The case that I will finish with is the behavior with leading + or - for the implementations that are using lpad

williambdean avatar May 24 '25 12:05 williambdean

I've added a test case for the leading +. zfill from python does "+1".zfill(3) = "+01". But it seems like the polars eager does something different

These are the current failures with the latest commits:


===================================================== short test summary info =====================================================
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pyarrow] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[polars[eager]] - AssertionError: Mismatch at index 1: 0+1 != +01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pandas[pyarrow]] - AttributeError: 'ArrowExtensionArray' object has no attribute '_str_map'. Did you mean: '_str_pad'?
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[ibis] - AssertionError: Mismatch at index 0: 0-1 != -01
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[sqlframe] - AssertionError: Mismatch at index 0: 0-1 != -01

Which behavior is suppose to match?

Seeing some differences

[ins] In [4]: pd.Series(["+1", "++1"]).str.zfill(4)
Out[4]:
0    +001
1    +0+1
dtype: object

[ins] In [5]: import polars as pl

[ins] In [6]: pl.Series(["+1", "++1"]).str.zfill(4)
Out[6]:
shape: (2,)
Series: '' [str]
[
        "00+1"
        "0++1"
]

[ins] In [7]: "+1".zfill(4)
Out[7]: '+001'

[ins] In [8]: "++1".zfill(4)
Out[8]: '+0+1'

williambdean avatar May 24 '25 15:05 williambdean

Probably good to match polars

MarcoGorelli avatar May 24 '25 16:05 MarcoGorelli

well pandas and polars don't match 😆

williambdean avatar May 24 '25 16:05 williambdean

I don't really understand the pandas output, but we can standardise on what polars does and note the pandas difference in the docstring? This feels like really edge case behavior anyway

MarcoGorelli avatar May 24 '25 16:05 MarcoGorelli

I've adjusted the tests and the implementations to match polars behavior. I have a case for pandas.

I'm new to arrow and still working through that implmenetation. Open to pointers!

williambdean avatar May 26 '25 15:05 williambdean

Thanks for the review. I will add those changes.

Do you know how to do the arrow implementation?

Arrow test failure
    def zfill(self, width: int) -> ArrowSeries:
        length = pc.utf8_length(self.native)
        less_than_width = pc.less(length, width)
        starts_with_minus = pc.equal(self.slice(0, 1).native, lit("-"))

        result = pc.case_when(
            [
                (
>                   starts_with_minus & less_than_width,
                    pc.utf8_lpad(
                        self.slice(1, None).native, width - 2, padding="0"
                    ).concat(lit("-")),
                ),
                (less_than_width, pc.utf8_lpad(self.native, width=width, padding="0")),
            ],
            self.native,
        )
E       TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'

narwhals/_arrow/series_str.py:72: TypeError
===================================================== short test summary info =====================================================
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill_series[pyarrow] - TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'
FAILED tests/expr_and_series/str/zfill_test.py::test_str_zfill[pyarrow] - TypeError: unsupported operand type(s) for &: 'pyarrow.lib.ChunkedArray' and 'pyarrow.lib.ChunkedArray'

williambdean avatar May 27 '25 16:05 williambdean

https://github.com/narwhals-dev/narwhals/pull/2598#issuecomment-2913161818

@williambdean hopefully (https://github.com/narwhals-dev/narwhals/pull/2598/commits/76373df554356ac3695eb1177dee18177d4fa6a3) will do the trick 🤞

The only example I found for pc.case_when was in these tests: https://github.com/apache/arrow/blob/1cabcb32326ccb5e227e8a0574d6a81bcbde6f12/python/pyarrow/tests/test_compute.py#L3280-L3284

dangotbanned avatar May 27 '25 18:05 dangotbanned

Thanks for the help @dangotbanned !

williambdean avatar May 27 '25 18:05 williambdean

Thanks for the help @dangotbanned !

Happy to help, all yours 🙌

dangotbanned avatar May 27 '25 18:05 dangotbanned

Ha, seems like the pandas behavior has even changed in the versions that are being tested? Confusing. https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670950?pr=2598

The edits you made worked for me locally, but seeing some pyarrow failures: https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670950?pr=2598

And would need to implement the dask behavior as well: https://github.com/narwhals-dev/narwhals/actions/runs/15283097790/job/42986670953?pr=2598

EDIT: push up dask implementation in 86d9a599

williambdean avatar May 27 '25 18:05 williambdean

Ha, seems like the pandas behavior has even changed in the versions that are being tested?

Is that just for old versions? If so, fine to skip the test for old versions of pandas IMO. Else, if you feel like it, reimplement zfill like you did for other backends with pad (so long as you just use .str methods, and not apply / map)

MarcoGorelli avatar May 27 '25 18:05 MarcoGorelli

It seems to just be the https://github.com/narwhals-dev/narwhals/actions/workflows/extremes.yml workflow now. Which gets caught in the random runs as well:

https://github.com/narwhals-dev/narwhals/actions/runs/15283218688/job/42988106594

williambdean avatar May 27 '25 23:05 williambdean

i'm not too bothered about what happens for '++1', but for '+1' at least i think the Polars output doesn't look correct

MarcoGorelli avatar May 28 '25 09:05 MarcoGorelli

ok this is marked as "accepted" in Polars https://github.com/pola-rs/polars/issues/22983

So I'd say:

  • for '+1'.zfill(3), the output should be '+01'. We may need to xfail for Polars for now
  • for '++1'.zfill(4), i'd say that the behaviour is unspecified and we shouldn't worry about what happens

sorry for contradictory instructions from me @williambdean , my bad here

MarcoGorelli avatar May 28 '25 10:05 MarcoGorelli

Fixed in Polars now https://github.com/pola-rs/polars/pull/22985#event-17849747949

MarcoGorelli avatar May 28 '25 12:05 MarcoGorelli

Thanks for the eyes and work here. Let me look this over and make the edits

is there an example of xfail for a specific constructor or do I just go off the names like I have been doing?

williambdean avatar May 28 '25 14:05 williambdean

yeah plenty of examples in the tests, if you search requests.applymarker(pytest.mark.xfail) you should find a few

MarcoGorelli avatar May 28 '25 14:05 MarcoGorelli

ok this is marked as "accepted" in Polars pola-rs/polars#22983

So I'd say:

* for `'+1'.zfill(3)`, the output should be `'+01'`. We may need to xfail for Polars for now

Would it make sense to provide compatibility for earlier polars versions? There are other methods we've got that for, and this PR has a lot of examples of what that could look like from the other backends currently

dangotbanned avatar May 28 '25 18:05 dangotbanned