cudf Support `cudf-polars` `str.zfill`

Closes https://github.com/rapidsai/cudf/issues/19035

I believe this needs https://github.com/pola-rs/polars/pull/22985 to pass the one remaining failing test and the column overload described here https://github.com/rapidsai/cudf/issues/19035#issuecomment-2937332033

Jun 04 '25 01:06 brandon-b-miller

I assume that this PR will updated to leverage https://github.com/rapidsai/cudf/pull/19090 once that is merged?

Jun 12 '25 01:06 vyasr

Sorry @mroeschke , just getting back to this now. I've fixed up the PR which should now be passing tests, and responded to the open question above.

Jul 11 '25 18:07 brandon-b-miller

Looks like py-polars/tests/unit/operations/namespaces/string/test_pad.py::test_str_zfill_unicode_not_respected may need to be added to the expected failures list in cudf/python/cudf_polars/cudf_polars/testing/plugin.py. Apparently Polars doesn't add zeros for unicode characters.

Jul 11 '25 23:07 mroeschke

Polars doesn't add zeros for unicode characters.

Should this be a polars issue? Who is right?

Jul 15 '25 13:07 brandon-b-miller

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

Jul 15 '25 20:07 mroeschke

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

It may be this piece of the docs:

This method is intended for padding numeric strings. If your data contains non-ASCII characters, use [pad_start()](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.pad_start.html#polars.Series.str.pad_start) instead.

Ah. I wonder if we need to do something beyond xfailing here? It seems like in the wild users would be mostly zfilling numeric data but it's still dangerous to have a behavior divergence with the GPU backend. There aren't a lot of great options though. Off the top of my head we could

extend the libcudf zfill api to ignore unicode optionally
scan the data unnecessarily in 99% of cases to make us safe from the one edge case
Document the issue somewhere

None of the above seem particularly amazing. I'll think on what to do in lieu of any obvious ideas I missed.

Jul 15 '25 22:07 brandon-b-miller

I wonder if we need to do something beyond xfailing here?

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

Jul 15 '25 22:07 mroeschke

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

Thanks- this sounds like a winner 😄

Jul 15 '25 22:07 brandon-b-miller

It looks like polars wants to pad strings like abc as well as numeric, so we can't use is_fixed_point. But I think I might see a way forward with all_characters_of_type, will try it out.

Jul 18 '25 01:07 brandon-b-miller

after playing around with this a bit I can't think of a way out of this without throwing for the cases where the APIs diverge. AFAICT they'll only definitely agree for a certain subset of characters in the source data.

@davidwendt whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

Jul 21 '25 15:07 brandon-b-miller

hats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

Maybe cudf::strings::contains_re and then cudf::reduce

Jul 22 '25 12:07 Matt711

... whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

I believe this API would cover everything except empty strings https://docs.rapids.ai/api/cudf/stable/pylibcudf/api_docs/strings/char_types/#pylibcudf.strings.char_types.all_characters_of_type

Jul 23 '25 17:07 davidwendt

Added some validation and conditional xfailing based on the polars version here. We now guard for non-ascii characters in the simplest way I can think of. It's not ideal but I think we should move forward and reassess if perf concerns ever surface.

Aug 05 '25 16:08 brandon-b-miller

/merge

Aug 05 '25 21:08 mroeschke