Support `cudf-polars` `str.zfill`
Closes https://github.com/rapidsai/cudf/issues/19035
I believe this needs https://github.com/pola-rs/polars/pull/22985 to pass the one remaining failing test and the column overload described here https://github.com/rapidsai/cudf/issues/19035#issuecomment-2937332033
I assume that this PR will updated to leverage https://github.com/rapidsai/cudf/pull/19090 once that is merged?
Sorry @mroeschke , just getting back to this now. I've fixed up the PR which should now be passing tests, and responded to the open question above.
Looks like py-polars/tests/unit/operations/namespaces/string/test_pad.py::test_str_zfill_unicode_not_respected may need to be added to the expected failures list in cudf/python/cudf_polars/cudf_polars/testing/plugin.py. Apparently Polars doesn't add zeros for unicode characters.
Polars doesn't add zeros for unicode characters.
Should this be a polars issue? Who is right?
Should this be a polars issue? Who is right?
Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters
Should this be a polars issue? Who is right?
Given that Polars has a dedicated test checking that zeros aren't padded on unicode, I suppose they are "right" and ideally libcudf wouldn't pad unicode characters
It may be this piece of the docs:
This method is intended for padding numeric strings. If your data contains non-ASCII characters, use [pad_start()](https://docs.pola.rs/api/python/stable/reference/series/api/polars.Series.str.pad_start.html#polars.Series.str.pad_start) instead.
Ah. I wonder if we need to do something beyond xfailing here? It seems like in the wild users would be mostly zfilling numeric data but it's still dangerous to have a behavior divergence with the GPU backend. There aren't a lot of great options though. Off the top of my head we could
- extend the libcudf
zfillapi to ignore unicode optionally - scan the data unnecessarily in 99% of cases to make us safe from the one edge case
- Document the issue somewhere
None of the above seem particularly amazing. I'll think on what to do in lieu of any obvious ideas I missed.
I wonder if we need to do something beyond xfailing here?
Ah OK so I suppose the "proper" solution could use pylibcudf.strings.convert.convert_fixed_point.is_fixed_point to use with copy_if_else to used the zfilled values if the values are numeric else the old values.
Ah OK so I suppose the "proper" solution could use
pylibcudf.strings.convert.convert_fixed_point.is_fixed_pointto use withcopy_if_elseto used thezfilled values if the values are numeric else the old values.
Thanks- this sounds like a winner 😄
It looks like polars wants to pad strings like abc as well as numeric, so we can't use is_fixed_point. But I think I might see a way forward with all_characters_of_type, will try it out.
after playing around with this a bit I can't think of a way out of this without throwing for the cases where the APIs diverge. AFAICT they'll only definitely agree for a certain subset of characters in the source data.
@davidwendt whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?
hats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?
Maybe cudf::strings::contains_re and then cudf::reduce
... whats the fastest way of checking if a column of strings contains only alphanumeric characters, spaces, or empty strings?
I believe this API would cover everything except empty strings https://docs.rapids.ai/api/cudf/stable/pylibcudf/api_docs/strings/char_types/#pylibcudf.strings.char_types.all_characters_of_type
Added some validation and conditional xfailing based on the polars version here. We now guard for non-ascii characters in the simplest way I can think of. It's not ideal but I think we should move forward and reassess if perf concerns ever surface.
/merge