polars
polars copied to clipboard
fix: incorrect negative offset in multi-byte string slicing
Fixes https://github.com/pola-rs/polars/issues/15136.
Issue identified by @mcrumiller here: https://github.com/pola-rs/polars/pull/14425#discussion_r1529068025.
Ahh I have a fix for this too, but I'm assuming yours is better. Also the primary issue is now #15136.
@mcrumiller It was rather tricky, we also were doing things wrong with 'very negative' indices. For example pl.Series(["abcd"]).str.slice(-100, 2) should return "", not "ab".
pl.Series(["abcd"]).str.slice(-100, 2)should return"", not"ab".
I wasn't sure what the correct usage here was--is it a sliding window that gets "stuck" when it hits the end (in which case it would be "ab") or do we let the window go anywhere and see what's in it? I'm guessing the latter per your comment.
@mcrumiller This is the current behavior of pl.Series(["abcd"]).str.slice(i, 2):
-7 ab
-6 ab
-5 ab
0 ab -4 ab
1 bc -3 bc
2 cd -2 cd
3 d -1 d
4
5
This is the bugfix behavior:
-7
-6
-5 a
0 ab -4 ab
1 bc -3 bc
2 cd -2 cd
3 d -1 d
4
5
To me the latter is much more logical. It's also consistent with Python:
>>> "abcd"[-5:-3]
"a"
This is "a", not "ab".
Ugh but this is inconsistent with .slice... Will put this on draft for now until we've discussed it.
Most other languages implement slice with start/stop/step instead of start/length. Is this worth considering? It would then allow us to use slice syntax i.e. pl.col("s").str[0:5] as pandas does, which is a nice syntax that I've always wished we had in polars. We could also then deprecate gather_every as it could be consumed by the slice syntax.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 81.22%. Comparing base (
066c262) to head (00ea716). Report is 11 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15140 +/- ##
==========================================
+ Coverage 81.08% 81.22% +0.13%
==========================================
Files 1342 1346 +4
Lines 174150 175228 +1078
Branches 2459 2506 +47
==========================================
+ Hits 141210 142324 +1114
+ Misses 32473 32424 -49
- Partials 467 480 +13
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Any news on this on?
@ritchie46 I think the question up for discussion is how str.slice should work with negative offsets that extend past the start:
s = "abcdefghij" # length 10
s2 = s.str.slice(-12, 5)
Which of the two options should this return?
# option 1: window sticks to left edge
[ s2 ]
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
----|---|---|---|---|---|---|---|---|---
| a | b | c | d | e | f | g | h | i | j
= "abcde"
# option 2: window positioned absolutely
[ s2 ]
-1 | -2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
----|---|---|---|---|---|---|---|---|---
| a | b | c | d | e | f | g | h | i | j
= "abc"
My other comment is that in both Rust and python, slice is defined by start and stop args (which would remove this issue entirely) instead of a start and length. Why do we do it this way (with length), and should we change the slice behavior?
@ritchie46 This PR will be merged later, as part of this issue: https://github.com/pola-rs/polars/issues/15232. Unless you temporarily want to make str.slice inconsistent with .slice until we fix .slice as well.
@mcrumiller We've discussed this internally and we want to go forward with the solution described in that issue. We want to keep Polars' offset-length style, as it would literally break everyone's code out there to switch to Python's style if we even wanted to.