Improving string splitting in Polars
Description
Polars currently has three string-splitting functions:
-
str.split()-> list ofnstrings -
str.splitn(n)-> struct ofnstrings: makesn-1splits -
str.split_exact(n)-> struct ofn + 1strings: makesnsplits and discards everything after thenth
Issues:
- There's currently no way to select only one field, or only a subset of fields, without first constructing the entire struct, subsetting, and then renaming the field(s) of interest. This is computationally expensive (because you have to generate a bunch of string columns that are immediately discarded) and can take 3-4 steps for what could be a single operation.
-
splitnandsplit_exactalso do similar things but differ in terms of the number of splits they make (n-1vsn), the number of struct fields they result in (nvsn+1) and how they treat the end of the string. We want to obey the Polars guideline that functions that return different dtypes should be different functions. Sosplitn/split_exactcan't be merged withsplit, but can be merged with each other. - There's no way to split from the right instead of the left, without first reversing the string.
Proposed changes:
- Rename
splitntosplit_nto comply with Polars' naming conventions. - Add optional start/end indices (
start/end) tosplit_n. In other words, after splitting intonfields, select only fieldsstart:end.
To take a bioinformatics example, if a column called 'SNP' has the element 'rs12345:chr1:A:G', then:
pl.col.SNP.str.split_n(':', n=4) gives ['rs12345', 'chr1', 'A', 'G']
pl.col.SNP.str.split_n(':', n=4, start=1, end=3) gives ['chr1', 'A']
pl.col.SNP.str.split_n(':', n=3) gives ['rs12345', 'chr1', 'A:G'] --> this is the same as the currrent splitn(n=3)
pl.col.SNP.str.split_n(':', n=4, end=3) gives ['rs12345', 'chr1', 'A'] --> this is the same as the current split_exact(n=3)
pl.col.SNP.str.split_n(':', n=3, end=4) gives an error.
-
Add an optional field name argument (
names) tosplit_n, which would name the fields of the returned struct. If specified, requirelen(names)to equal the number of fields. -
Add a
rightparameter tosplit_n, which isFalseby default. Ifright=True, split from the right instead of the left, like Python'sstr.rsplitmethod. -
Add
str.split_nth, which returns apl.Stringcolumn of just thenth field. This would also take arightargument. For instance:
pl.col.SNP.str.split_nth(':', n=0) gives 'rs12345'
pl.col.SNP.str.split_nth(':', n=1, right=True) gives 'A'
If n > 0 and the nth ':' isn't found, the result is null. If the (n+1)th ':' isn't found, return the remainder of the string.
- Deprecate
str.split_exact, sincesplit_exact(n=n)can be expressed assplit_n(n=n+1, end=n).
Is there a specific use case for _n and _exact returning a Struct instead of a List?
I would have assumed they returned a List.
[Update]: https://github.com/pola-rs/polars/issues/4359#issuecomment-1211680992
A list type has a more expensive memory format as it is designed to deal with varying length elements. A struct can be decomposed in all fields zero cost.
Shouldn't we be using an array then? Which is polars' fixed length list.
@mkleinbort like @cmdlineluser pointed out in their edited post, the advantage of struct over list/array is that you can get each struct field as a column with zero cost. That's typically what you want when you split_n/split_exact: for each part of the string to become a column.
I see. I assumed the arr data type was at least as efficient as a struct. Noted.
@ritchie46 curious to hear your thoughts on split_nth in particular - seems like a win for efficiency as well as usability.
Perhaps related to this, once I have split a string, I often want to create columns from the split. The docs say to do this:
df.with_columns(
[
pl.col("x")
.str.split_exact("_", 1)
.struct.rename_fields(["first_part", "second_part"])
.alias("fields"),
]
).unnest("fields")
But I really want to do this instead:
df.with_columns(
[
pl.col("x")
.str.split_exact("_", 1)
.struct.rename_fields(["first_part", "second_part"])
.unnest()
]
Would that make sense?
Or perhaps even a struct expression which combines renaming and unnesting?
df.with_columns(
[
pl.col("x")
.str.split_exact("_", 1)
.struct.unnest_to(["first_part", "second_part"])
]
Expression level unnest pl.Expr.unnest() has been requested a few times, but polars is not able to support it at the moment. I want the feature too, but it violates some very core design decisions that underpin all expressions.
@daviewales You can also create the columns directly without unnest:
fields = pl.col("x").str.split_exact("_", 1).struct
df.with_columns(
first_part = fields[0],
second_part = fields[1]
)
# shape: (4, 3)
# ┌──────┬────────────┬─────────────┐
# │ x ┆ first_part ┆ second_part │
# │ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str │
# ╞══════╪════════════╪═════════════╡
# │ a_1 ┆ a ┆ 1 │
# │ null ┆ null ┆ null │
# │ c ┆ c ┆ null │
# │ d_4 ┆ d ┆ 4 │
# └──────┴────────────┴─────────────┘
It's a bit awkward having to use a variable to avoid repeating the longer expression.