polars
polars copied to clipboard
fix(rust): Treat splitting by empty string as iterating over chars
Changes
This PR fixes the behavior of str.split, str.split_exact and str.splitn by treating splitting by an empty string as iterating over chars.
Closes #14604
Can we see if we can support this without allocating a boxed iterator on every row?
@ritchie46 Sure. I restored the signatures of split_to_struct and split_helper in b380968ed74a30fb1aaf4d85d6116cf393ebd697. Now there are no boxed trait objects.
I think it's easier/faster if we modify split_to_struct instead of every single caller of it. That way we also don't have to do a filter over every single character in the string just to modify the edge case behavior.
@orlp I refactored the code 9e91234281895936c3df32f759f887d8782867d0. Now there is no call to filter, but for splitn, it still needs to use chars().count() to determine the right number of substrings when the pattern is empty. Is this the kind of thing you had in mind?
@haocheng6 What you should do is when by == "" you should use s.splitn(n + 1, "").skip(1) instead of op which should automatically handle all cases.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 80.95%. Comparing base (
dee176c) to head (5bf814f). Report is 31 commits behind head on main.
:exclamation: Current head 5bf814f differs from pull request most recent head b6b24c7. Consider uploading reports for the commit b6b24c7 to get more accurate results
Additional details and impacted files
@@ Coverage Diff @@
## main #15922 +/- ##
==========================================
- Coverage 81.24% 80.95% -0.29%
==========================================
Files 1382 1384 +2
Lines 176628 178145 +1517
Branches 3032 3043 +11
==========================================
+ Hits 143494 144216 +722
- Misses 32649 33446 +797
+ Partials 485 483 -2
| Flag | Coverage Δ | |
|---|---|---|
| python | 74.43% <100.00%> (-0.25%) |
:arrow_down: |
| rust | 78.09% <100.00%> (-0.28%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
CodSpeed Performance Report
Merging #15922 will improve performances by 26.19%
Comparing haocheng6:14604-fix-str-split-by-empty-string (b6b24c7) with main (be09246)
Summary
⚡ 2 improvements
✅ 33 untouched benchmarks
Benchmarks breakdown
| Benchmark | main |
haocheng6:14604-fix-str-split-by-empty-string |
Change | |
|---|---|---|---|---|
| ⚡ | test_filter2 |
2.8 ms | 2.2 ms | +24.41% |
| ⚡ | test_tpch_q15 |
7.8 ms | 6.2 ms | +26.19% |
@haocheng6 Thanks for the PR!