polars Improving string splitting in Polars

Description

Polars currently has three string-splitting functions:

str.split() -> list of n strings
str.splitn(n) -> struct of n strings: makes n-1 splits
str.split_exact(n) -> struct of n + 1 strings: makes n splits and discards everything after the nth

Issues:

There's currently no way to select only one field, or only a subset of fields, without first constructing the entire struct, subsetting, and then renaming the field(s) of interest. This is computationally expensive (because you have to generate a bunch of string columns that are immediately discarded) and can take 3-4 steps for what could be a single operation.
splitn and split_exact also do similar things but differ in terms of the number of splits they make (n-1 vs n), the number of struct fields they result in (n vs n+1) and how they treat the end of the string. We want to obey the Polars guideline that functions that return different dtypes should be different functions. So splitn/split_exact can't be merged with split, but can be merged with each other.
There's no way to split from the right instead of the left, without first reversing the string.

Proposed changes:

Rename splitn to split_n to comply with Polars' naming conventions.
Add optional start/end indices (start/end) to split_n. In other words, after splitting into n fields, select only fields start:end.

To take a bioinformatics example, if a column called 'SNP' has the element 'rs12345:chr1:A:G', then:

pl.col.SNP.str.split_n(':', n=4) gives ['rs12345', 'chr1', 'A', 'G'] pl.col.SNP.str.split_n(':', n=4, start=1, end=3) gives ['chr1', 'A'] pl.col.SNP.str.split_n(':', n=3) gives ['rs12345', 'chr1', 'A:G'] --> this is the same as the currrent splitn(n=3) pl.col.SNP.str.split_n(':', n=4, end=3) gives ['rs12345', 'chr1', 'A'] --> this is the same as the current split_exact(n=3) pl.col.SNP.str.split_n(':', n=3, end=4) gives an error.

Add an optional field name argument (names) to split_n, which would name the fields of the returned struct. If specified, require len(names) to equal the number of fields.
Add a right parameter to split_n, which is False by default. If right=True, split from the right instead of the left, like Python's str.rsplit method.
Add str.split_nth, which returns a pl.String column of just the nth field. This would also take a right argument. For instance:

pl.col.SNP.str.split_nth(':', n=0) gives 'rs12345' pl.col.SNP.str.split_nth(':', n=1, right=True) gives 'A'

If n > 0 and the nth ':' isn't found, the result is null. If the (n+1)th ':' isn't found, return the remainder of the string.

Deprecate str.split_exact, since split_exact(n=n) can be expressed as split_n(n=n+1, end=n).

Jan 11 '24 19:01 Wainberg

Is there a specific use case for _n and _exact returning a Struct instead of a List?

I would have assumed they returned a List.

[Update]: https://github.com/pola-rs/polars/issues/4359#issuecomment-1211680992

A list type has a more expensive memory format as it is designed to deal with varying length elements. A struct can be decomposed in all fields zero cost.

Jan 11 '24 20:01 cmdlineluser

Shouldn't we be using an array then? Which is polars' fixed length list.

Jan 12 '24 00:01 mkleinbort

@mkleinbort like @cmdlineluser pointed out in their edited post, the advantage of struct over list/array is that you can get each struct field as a column with zero cost. That's typically what you want when you split_n/split_exact: for each part of the string to become a column.

Jan 12 '24 01:01 Wainberg

I see. I assumed the arr data type was at least as efficient as a struct. Noted.

Jan 12 '24 01:01 mkleinbort

@ritchie46 curious to hear your thoughts on split_nth in particular - seems like a win for efficiency as well as usability.

Jan 30 '24 00:01 Wainberg

Perhaps related to this, once I have split a string, I often want to create columns from the split. The docs say to do this:

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .alias("fields"),
    ]
).unnest("fields")

But I really want to do this instead:

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .unnest()
    ]

Would that make sense?

Or perhaps even a struct expression which combines renaming and unnesting?

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.unnest_to(["first_part", "second_part"])
    ]

Feb 01 '24 04:02 daviewales

Expression level unnest pl.Expr.unnest() has been requested a few times, but polars is not able to support it at the moment. I want the feature too, but it violates some very core design decisions that underpin all expressions.

Feb 01 '24 08:02 mkleinbort

@daviewales You can also create the columns directly without unnest:

fields = pl.col("x").str.split_exact("_", 1).struct

df.with_columns(
   first_part = fields[0], 
   second_part = fields[1]
)

# shape: (4, 3)
# ┌──────┬────────────┬─────────────┐
# │ x    ┆ first_part ┆ second_part │
# │ ---  ┆ ---        ┆ ---         │
# │ str  ┆ str        ┆ str         │
# ╞══════╪════════════╪═════════════╡
# │ a_1  ┆ a          ┆ 1           │
# │ null ┆ null       ┆ null        │
# │ c    ┆ c          ┆ null        │
# │ d_4  ┆ d          ┆ 4           │
# └──────┴────────────┴─────────────┘

It's a bit awkward having to use a variable to avoid repeating the longer expression.

Feb 01 '24 11:02 cmdlineluser