GH-46683: Add utf8_zfill kernel for sign-aware zero padding
What does this PR do?
Adds a new compute kernel utf8_zero_fill to Arrow Compute module. This kernel zero-pads UTF-8 strings while preserving leading signs.
Why is it needed?
Python's str.zfill() behavior is useful and expected in many data cleaning scenarios. Arrow lacked a direct equivalent.
Implementation details
- Introduces
Utf8ZeroFillTransformwith dedicatedZeroFillOptions - Registers
utf8_zero_fillas a compute kernel - Adds Python bindings and a clean alias:
utf8_zfill = utf8_zero_fill - Includes sign-aware padding logic
Are these changes tested?
- C++ unit tests:
TestStringKernels, Utf8ZeroFill - Python tests:
test_utf8_zfill()Ran both tests
- GitHub Issue: #46683
@pitrou Just wanted to give you a heads-up that the implementation and tests are ready for review.
Please let me know if I’ve missed anything or if there’s a changelog entry or any documentation I should update as part of this PR.
Thanks for the review @AlenkaF, took care of the CI errors and added an Example in the PadOptions docs as suggested
@github-actions crossbow submit preview-docs
Revision: 7d33c202452ac58873aa085af0e9e5a216784a5a
Submitted crossbow builds: ursacomputing/crossbow @ actions-08f3ce64cb
| Task | Status |
|---|---|
| preview-docs |
I think a suggestion of mine might have been lost during the review stage. What I meant to propose was setting padding="0" as the default, while still allowing users to override it with any other character. That would align better with the function's name (zero fill) and also provide the desired behaviour without requiring users to explicitly specify the padding each time.
What I meant to propose was setting
padding="0"as the default
Got it! I’ve updated it so that "0" is used as the default padding if none is provided, but users can still override it with their custom padding
Thanks! Just to check, what would currently pc.utf8_zfill(arr, options=pc.PadOptions(width=3, padding=' ')) produce? (padding with a whitespace)
Time for a naming discussion: should the function be called "utf8_zfill" (Python inspiration, but cryptic to non-Python programmers) or "utf8_zero_fill" (longer, more explicit)? cc @zanmato1984
Is it possible to use name utf8_zero_fill in C++ while map it to utf8_zfill in pyarrow?
We could, but for now we're mirroring the function names exactly, so that could end up confusing. Also, the documentation would be less shareable between C++ and Python.
We could add a Python alias though, such as utf8_zfill = utf8_zero_fill.
Aliasing sounds ideal I guess?
Just pushed the final changes. I introduced ZeroFillOptions as a standalone class instead of overloading PadOptions, renamed the kernel to utf8_zero_fill for clarity, and added a clean alias in Python (utf8_zfill = utf8_zero_fill)
Tests and docs are updated accordingly. Let me know if there's anything else you'd like adjusted
Thanks for the feedback @pitrou, I’ve addressed all the suggestions!
After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2bdcbda3c5f4cf237581856f41352826442a05d3.
There were no benchmark performance regressions. 🎉
The full Conbench report has more details.
@github-actions crossbow submit test-ubuntu-22.04-cpp-20
Revision: 54a5020183e09c938000b9ae4d1c2cf50d84d819
Submitted crossbow builds: ursacomputing/crossbow @ actions-3163db8baa
| Task | Status |
|---|---|
| test-ubuntu-22.04-cpp-20 |