arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-46683: Add utf8_zfill kernel for sign-aware zero padding

Open iabhi4 opened this issue 6 months ago • 13 comments

What does this PR do?

Adds a new compute kernel utf8_zero_fill to Arrow Compute module. This kernel zero-pads UTF-8 strings while preserving leading signs.

Why is it needed?

Python's str.zfill() behavior is useful and expected in many data cleaning scenarios. Arrow lacked a direct equivalent.

Implementation details

  • Introduces Utf8ZeroFillTransform with dedicated ZeroFillOptions
  • Registers utf8_zero_fill as a compute kernel
  • Adds Python bindings and a clean alias: utf8_zfill = utf8_zero_fill
  • Includes sign-aware padding logic

Are these changes tested?

  • C++ unit tests: TestStringKernels, Utf8ZeroFill
  • Python tests: test_utf8_zfill() Ran both tests
  • GitHub Issue: #46683

iabhi4 avatar Jun 15 '25 05:06 iabhi4

@pitrou Just wanted to give you a heads-up that the implementation and tests are ready for review. Please let me know if I’ve missed anything or if there’s a changelog entry or any documentation I should update as part of this PR.

iabhi4 avatar Jun 15 '25 05:06 iabhi4

Thanks for the review @AlenkaF, took care of the CI errors and added an Example in the PadOptions docs as suggested

iabhi4 avatar Jun 17 '25 22:06 iabhi4

@github-actions crossbow submit preview-docs

AlenkaF avatar Jun 18 '25 06:06 AlenkaF

Revision: 7d33c202452ac58873aa085af0e9e5a216784a5a

Submitted crossbow builds: ursacomputing/crossbow @ actions-08f3ce64cb

Task Status
preview-docs GitHub Actions

github-actions[bot] avatar Jun 18 '25 06:06 github-actions[bot]

I think a suggestion of mine might have been lost during the review stage. What I meant to propose was setting padding="0" as the default, while still allowing users to override it with any other character. That would align better with the function's name (zero fill) and also provide the desired behaviour without requiring users to explicitly specify the padding each time.

AlenkaF avatar Jun 19 '25 04:06 AlenkaF

What I meant to propose was setting padding="0" as the default

Got it! I’ve updated it so that "0" is used as the default padding if none is provided, but users can still override it with their custom padding

iabhi4 avatar Jun 23 '25 03:06 iabhi4

Thanks! Just to check, what would currently pc.utf8_zfill(arr, options=pc.PadOptions(width=3, padding=' ')) produce? (padding with a whitespace)

AlenkaF avatar Jun 23 '25 06:06 AlenkaF

Time for a naming discussion: should the function be called "utf8_zfill" (Python inspiration, but cryptic to non-Python programmers) or "utf8_zero_fill" (longer, more explicit)? cc @zanmato1984

pitrou avatar Jun 23 '25 07:06 pitrou

Is it possible to use name utf8_zero_fill in C++ while map it to utf8_zfill in pyarrow?

zanmato1984 avatar Jun 23 '25 08:06 zanmato1984

We could, but for now we're mirroring the function names exactly, so that could end up confusing. Also, the documentation would be less shareable between C++ and Python.

pitrou avatar Jun 23 '25 08:06 pitrou

We could add a Python alias though, such as utf8_zfill = utf8_zero_fill.

pitrou avatar Jun 23 '25 08:06 pitrou

Aliasing sounds ideal I guess?

zanmato1984 avatar Jun 23 '25 08:06 zanmato1984

Just pushed the final changes. I introduced ZeroFillOptions as a standalone class instead of overloading PadOptions, renamed the kernel to utf8_zero_fill for clarity, and added a clean alias in Python (utf8_zfill = utf8_zero_fill)

Tests and docs are updated accordingly. Let me know if there's anything else you'd like adjusted

iabhi4 avatar Jun 24 '25 00:06 iabhi4

Thanks for the feedback @pitrou, I’ve addressed all the suggestions!

iabhi4 avatar Jun 26 '25 01:06 iabhi4

After merging your PR, Conbench analyzed the 3 benchmarking runs that have been run so far on merge-commit 2bdcbda3c5f4cf237581856f41352826442a05d3.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

@github-actions crossbow submit test-ubuntu-22.04-cpp-20

kou avatar Jul 03 '25 23:07 kou

Revision: 54a5020183e09c938000b9ae4d1c2cf50d84d819

Submitted crossbow builds: ursacomputing/crossbow @ actions-3163db8baa

Task Status
test-ubuntu-22.04-cpp-20 GitHub Actions

github-actions[bot] avatar Jul 03 '25 23:07 github-actions[bot]