polars icon indicating copy to clipboard operation
polars copied to clipboard

fix: incorrect negative offset in multi-byte string slicing

Open orlp opened this issue 1 year ago • 7 comments

Fixes https://github.com/pola-rs/polars/issues/15136.

Issue identified by @mcrumiller here: https://github.com/pola-rs/polars/pull/14425#discussion_r1529068025.

orlp avatar Mar 18 '24 21:03 orlp

Ahh I have a fix for this too, but I'm assuming yours is better. Also the primary issue is now #15136.

mcrumiller avatar Mar 18 '24 21:03 mcrumiller

@mcrumiller It was rather tricky, we also were doing things wrong with 'very negative' indices. For example pl.Series(["abcd"]).str.slice(-100, 2) should return "", not "ab".

orlp avatar Mar 18 '24 21:03 orlp

pl.Series(["abcd"]).str.slice(-100, 2) should return "", not "ab".

I wasn't sure what the correct usage here was--is it a sliding window that gets "stuck" when it hits the end (in which case it would be "ab") or do we let the window go anywhere and see what's in it? I'm guessing the latter per your comment.

mcrumiller avatar Mar 18 '24 21:03 mcrumiller

@mcrumiller This is the current behavior of pl.Series(["abcd"]).str.slice(i, 2):

       -7 ab 
       -6 ab 
       -5 ab 
0 ab   -4 ab 
1 bc   -3 bc 
2 cd   -2 cd 
3 d    -1 d  
4
5

This is the bugfix behavior:

        -7    
        -6    
        -5 a  
0 ab    -4 ab 
1 bc    -3 bc 
2 cd    -2 cd 
3 d     -1 d  
4 
5 

To me the latter is much more logical. It's also consistent with Python:

>>> "abcd"[-5:-3]
"a"

This is "a", not "ab".

orlp avatar Mar 18 '24 21:03 orlp

Ugh but this is inconsistent with .slice... Will put this on draft for now until we've discussed it.

orlp avatar Mar 18 '24 21:03 orlp

Most other languages implement slice with start/stop/step instead of start/length. Is this worth considering? It would then allow us to use slice syntax i.e. pl.col("s").str[0:5] as pandas does, which is a nice syntax that I've always wished we had in polars. We could also then deprecate gather_every as it could be consumed by the slice syntax.

mcrumiller avatar Mar 18 '24 22:03 mcrumiller

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 81.22%. Comparing base (066c262) to head (00ea716). Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15140      +/-   ##
==========================================
+ Coverage   81.08%   81.22%   +0.13%     
==========================================
  Files        1342     1346       +4     
  Lines      174150   175228    +1078     
  Branches     2459     2506      +47     
==========================================
+ Hits       141210   142324    +1114     
+ Misses      32473    32424      -49     
- Partials      467      480      +13     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Mar 18 '24 22:03 codecov[bot]

Any news on this on?

ritchie46 avatar Mar 22 '24 13:03 ritchie46

@ritchie46 I think the question up for discussion is how str.slice should work with negative offsets that extend past the start:

s = "abcdefghij"  # length 10
s2 = s.str.slice(-12, 5)

Which of the two options should this return?

# option 1: window sticks to left edge

        [        s2         ]
          0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        ----|---|---|---|---|---|---|---|---|---
        | a | b | c | d | e | f | g | h | i | j

= "abcde"

# option 2: window positioned absolutely

[        s2         ]
-1 | -2 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
        ----|---|---|---|---|---|---|---|---|---
        | a | b | c | d | e | f | g | h | i | j

= "abc"

My other comment is that in both Rust and python, slice is defined by start and stop args (which would remove this issue entirely) instead of a start and length. Why do we do it this way (with length), and should we change the slice behavior?

mcrumiller avatar Mar 22 '24 13:03 mcrumiller

@ritchie46 This PR will be merged later, as part of this issue: https://github.com/pola-rs/polars/issues/15232. Unless you temporarily want to make str.slice inconsistent with .slice until we fix .slice as well.

@mcrumiller We've discussed this internally and we want to go forward with the solution described in that issue. We want to keep Polars' offset-length style, as it would literally break everyone's code out there to switch to Python's style if we even wanted to.

orlp avatar Mar 22 '24 14:03 orlp