cudf icon indicating copy to clipboard operation
cudf copied to clipboard

Support `cudf-polars` `str.zfill`

Open brandon-b-miller opened this issue 6 months ago • 1 comments

Closes https://github.com/rapidsai/cudf/issues/19035

I believe this needs https://github.com/pola-rs/polars/pull/22985 to pass the one remaining failing test and the column overload described here https://github.com/rapidsai/cudf/issues/19035#issuecomment-2937332033

brandon-b-miller avatar Jun 04 '25 01:06 brandon-b-miller

I assume that this PR will updated to leverage https://github.com/rapidsai/cudf/pull/19090 once that is merged?

vyasr avatar Jun 12 '25 01:06 vyasr

Sorry @mroeschke , just getting back to this now. I've fixed up the PR which should now be passing tests, and responded to the open question above.

brandon-b-miller avatar Jul 11 '25 18:07 brandon-b-miller

Looks like py-polars/tests/unit/operations/namespaces/string/test_pad.py::test_str_zfill_unicode_not_respected may need to be added to the expected failures list in cudf/python/cudf_polars/cudf_polars/testing/plugin.py. Apparently Polars doesn't add zeros for unicode characters.

mroeschke avatar Jul 11 '25 23:07 mroeschke

Polars doesn't add zeros for unicode characters.

Should this be a polars issue? Who is right?

brandon-b-miller avatar Jul 15 '25 13:07 brandon-b-miller

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

mroeschke avatar Jul 15 '25 20:07 mroeschke

Should this be a polars issue? Who is right?

Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters

It may be this piece of the docs:

This method is intended for padding numeric strings. If your data contains non-ASCII characters, use [pad_start()](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.pad_start.html#polars.Series.str.pad_start) instead.

Ah. I wonder if we need to do something beyond xfailing here? It seems like in the wild users would be mostly zfilling numeric data but it's still dangerous to have a behavior divergence with the GPU backend. There aren't a lot of great options though. Off the top of my head we could

  • extend the libcudf zfill api to ignore unicode optionally
  • scan the data unnecessarily in 99% of cases to make us safe from the one edge case
  • Document the issue somewhere

None of the above seem particularly amazing. I'll think on what to do in lieu of any obvious ideas I missed.

brandon-b-miller avatar Jul 15 '25 22:07 brandon-b-miller

I wonder if we need to do something beyond xfailing here?

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

mroeschke avatar Jul 15 '25 22:07 mroeschke

Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.

Thanks- this sounds like a winner 😄

brandon-b-miller avatar Jul 15 '25 22:07 brandon-b-miller

It looks like polars wants to pad strings like abc as well as numeric, so we can't use is_fixed_point. But I think I might see a way forward with all_characters_of_type, will try it out.

brandon-b-miller avatar Jul 18 '25 01:07 brandon-b-miller

after playing around with this a bit I can't think of a way out of this without throwing for the cases where the APIs diverge. AFAICT they'll only definitely agree for a certain subset of characters in the source data.

@davidwendt whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

brandon-b-miller avatar Jul 21 '25 15:07 brandon-b-miller

hats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

Maybe cudf::strings::contains_re and then cudf::reduce

Matt711 avatar Jul 22 '25 12:07 Matt711

... whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?

I believe this API would cover everything except empty strings https://docs.rapids.ai/api/cudf/stable/pylibcudf/api_docs/strings/char_types/#pylibcudf.strings.char_types.all_characters_of_type

davidwendt avatar Jul 23 '25 17:07 davidwendt

Added some validation and conditional xfailing based on the polars version here. We now guard for non-ascii characters in the simplest way I can think of. It's not ideal but I think we should move forward and reassess if perf concerns ever surface.

brandon-b-miller avatar Aug 05 '25 16:08 brandon-b-miller

/merge

mroeschke avatar Aug 05 '25 21:08 mroeschke