polars
polars copied to clipboard
feat(python): add head/tail under string namespace
Resolves #10337 and #10349
~~There is some discussion on the naming of the functions.~~
Following the discussion in the linked issue:
I'd much prefer head and tail that always keep a given amount of Unicode codepoints (which corresponds to bytes for ASCII) from respectively the start/end of the string.
Do you think the docstrings should specify what exactly a "character" is?
@avimallu perhaps, yes. We don't current for slice either. From what I gather, .str.slice() returns code points (characters), which isn't referenced in the docstring, but probably should be.
Okay @mcrumiller. The average person (ala me) isn't probably familiar with codepoints, but it is an important distinction to be aware of. Maybe to avoid confusion for the unfamiliar folks, while simultaneously providing enough info to the ones looking for it:
Returns the first/last n characters (strictly, UTF8 code points) of a UTF8 string.
Replace UTF8 code points with what is technically accurate?
This should still support negative arguments as we discussed.
@orlp I'm working on. I'm still pretty new to rust and haven't really internalized most of the concepts. pyarrow2's string slicing only operates on a fixed input length so I have to unpack the pyarrow array and apply this to the elements inside, and I'm still trying to figure out how to do that. I may post a non-working commit and ask for some assistance.
@mcrumiller I understand, I commented more in case someone else wanted to review/merge in the current state.
GitHub has a "draft" feature you can use while your work is in progress. I went ahead and clicked the button for you 😸
That is a useful feature, thank you @stinodego !
I'm going to close and start from scratch rather than trying to revive this ancient PR.