polars icon indicating copy to clipboard operation
polars copied to clipboard

fix(rust): Treat splitting by empty string as iterating over chars

Open haocheng6 opened this issue 1 year ago • 6 comments

Changes

This PR fixes the behavior of str.split, str.split_exact and str.splitn by treating splitting by an empty string as iterating over chars.

Closes #14604

haocheng6 avatar Apr 26 '24 22:04 haocheng6

Can we see if we can support this without allocating a boxed iterator on every row?

ritchie46 avatar Apr 27 '24 09:04 ritchie46

@ritchie46 Sure. I restored the signatures of split_to_struct and split_helper in b380968ed74a30fb1aaf4d85d6116cf393ebd697. Now there are no boxed trait objects.

haocheng6 avatar Apr 27 '24 23:04 haocheng6

I think it's easier/faster if we modify split_to_struct instead of every single caller of it. That way we also don't have to do a filter over every single character in the string just to modify the edge case behavior.

orlp avatar Apr 28 '24 09:04 orlp

@orlp I refactored the code 9e91234281895936c3df32f759f887d8782867d0. Now there is no call to filter, but for splitn, it still needs to use chars().count() to determine the right number of substrings when the pattern is empty. Is this the kind of thing you had in mind?

haocheng6 avatar Apr 28 '24 22:04 haocheng6

@haocheng6 What you should do is when by == "" you should use s.splitn(n + 1, "").skip(1) instead of op which should automatically handle all cases.

orlp avatar Apr 29 '24 11:04 orlp

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 80.95%. Comparing base (dee176c) to head (5bf814f). Report is 31 commits behind head on main.

:exclamation: Current head 5bf814f differs from pull request most recent head b6b24c7. Consider uploading reports for the commit b6b24c7 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15922      +/-   ##
==========================================
- Coverage   81.24%   80.95%   -0.29%     
==========================================
  Files        1382     1384       +2     
  Lines      176628   178145    +1517     
  Branches     3032     3043      +11     
==========================================
+ Hits       143494   144216     +722     
- Misses      32649    33446     +797     
+ Partials      485      483       -2     
Flag Coverage Δ
python 74.43% <100.00%> (-0.25%) :arrow_down:
rust 78.09% <100.00%> (-0.28%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Apr 29 '24 23:04 codecov[bot]

CodSpeed Performance Report

Merging #15922 will improve performances by 26.19%

Comparing haocheng6:14604-fix-str-split-by-empty-string (b6b24c7) with main (be09246)

Summary

⚡ 2 improvements ✅ 33 untouched benchmarks

Benchmarks breakdown

Benchmark main haocheng6:14604-fix-str-split-by-empty-string Change
test_filter2 2.8 ms 2.2 ms +24.41%
test_tpch_q15 7.8 ms 6.2 ms +26.19%

codspeed-hq[bot] avatar May 01 '24 10:05 codspeed-hq[bot]

@haocheng6 Thanks for the PR!

orlp avatar May 01 '24 11:05 orlp