polars icon indicating copy to clipboard operation
polars copied to clipboard

Improving string splitting in Polars

Open Wainberg opened this issue 2 years ago • 4 comments

Description

Polars currently has three string-splitting functions:

  • str.split() -> list of n strings
  • str.splitn(n) -> struct of n strings: makes n-1 splits
  • str.split_exact(n) -> struct of n + 1 strings: makes n splits and discards everything after the nth

Issues:

  • There's currently no way to select only one field, or only a subset of fields, without first constructing the entire struct, subsetting, and then renaming the field(s) of interest. This is computationally expensive (because you have to generate a bunch of string columns that are immediately discarded) and can take 3-4 steps for what could be a single operation.
  • splitn and split_exact also do similar things but differ in terms of the number of splits they make (n-1 vs n), the number of struct fields they result in (n vs n+1) and how they treat the end of the string. We want to obey the Polars guideline that functions that return different dtypes should be different functions. So splitn/split_exact can't be merged with split, but can be merged with each other.
  • There's no way to split from the right instead of the left, without first reversing the string.

Proposed changes:

  1. Rename splitn to split_n to comply with Polars' naming conventions.
  2. Add optional start/end indices (start/end) to split_n. In other words, after splitting into n fields, select only fields start:end.

To take a bioinformatics example, if a column called 'SNP' has the element 'rs12345:chr1:A:G', then:

pl.col.SNP.str.split_n(':', n=4) gives ['rs12345', 'chr1', 'A', 'G'] pl.col.SNP.str.split_n(':', n=4, start=1, end=3) gives ['chr1', 'A'] pl.col.SNP.str.split_n(':', n=3) gives ['rs12345', 'chr1', 'A:G'] --> this is the same as the currrent splitn(n=3) pl.col.SNP.str.split_n(':', n=4, end=3) gives ['rs12345', 'chr1', 'A'] --> this is the same as the current split_exact(n=3) pl.col.SNP.str.split_n(':', n=3, end=4) gives an error.

  1. Add an optional field name argument (names) to split_n, which would name the fields of the returned struct. If specified, require len(names) to equal the number of fields.

  2. Add a right parameter to split_n, which is False by default. If right=True, split from the right instead of the left, like Python's str.rsplit method.

  3. Add str.split_nth, which returns a pl.String column of just the nth field. This would also take a right argument. For instance:

pl.col.SNP.str.split_nth(':', n=0) gives 'rs12345' pl.col.SNP.str.split_nth(':', n=1, right=True) gives 'A'

If n > 0 and the nth ':' isn't found, the result is null. If the (n+1)th ':' isn't found, return the remainder of the string.

  1. Deprecate str.split_exact, since split_exact(n=n) can be expressed as split_n(n=n+1, end=n).

Wainberg avatar Jan 11 '24 19:01 Wainberg

Is there a specific use case for _n and _exact returning a Struct instead of a List?

I would have assumed they returned a List.

[Update]: https://github.com/pola-rs/polars/issues/4359#issuecomment-1211680992

A list type has a more expensive memory format as it is designed to deal with varying length elements. A struct can be decomposed in all fields zero cost.

cmdlineluser avatar Jan 11 '24 20:01 cmdlineluser

Shouldn't we be using an array then? Which is polars' fixed length list.

mkleinbort avatar Jan 12 '24 00:01 mkleinbort

@mkleinbort like @cmdlineluser pointed out in their edited post, the advantage of struct over list/array is that you can get each struct field as a column with zero cost. That's typically what you want when you split_n/split_exact: for each part of the string to become a column.

Wainberg avatar Jan 12 '24 01:01 Wainberg

I see. I assumed the arr data type was at least as efficient as a struct. Noted.

mkleinbort avatar Jan 12 '24 01:01 mkleinbort

@ritchie46 curious to hear your thoughts on split_nth in particular - seems like a win for efficiency as well as usability.

Wainberg avatar Jan 30 '24 00:01 Wainberg

Perhaps related to this, once I have split a string, I often want to create columns from the split. The docs say to do this:

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .alias("fields"),
    ]
).unnest("fields")

But I really want to do this instead:

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.rename_fields(["first_part", "second_part"])
        .unnest()
    ]

Would that make sense?

Or perhaps even a struct expression which combines renaming and unnesting?

df.with_columns(
    [
        pl.col("x")
        .str.split_exact("_", 1)
        .struct.unnest_to(["first_part", "second_part"])
    ]

daviewales avatar Feb 01 '24 04:02 daviewales

Expression level unnest pl.Expr.unnest() has been requested a few times, but polars is not able to support it at the moment. I want the feature too, but it violates some very core design decisions that underpin all expressions.

mkleinbort avatar Feb 01 '24 08:02 mkleinbort

@daviewales You can also create the columns directly without unnest:

fields = pl.col("x").str.split_exact("_", 1).struct

df.with_columns(
   first_part = fields[0], 
   second_part = fields[1]
)

# shape: (4, 3)
# ┌──────┬────────────┬─────────────┐
# │ x    ┆ first_part ┆ second_part │
# │ ---  ┆ ---        ┆ ---         │
# │ str  ┆ str        ┆ str         │
# ╞══════╪════════════╪═════════════╡
# │ a_1  ┆ a          ┆ 1           │
# │ null ┆ null       ┆ null        │
# │ c    ┆ c          ┆ null        │
# │ d_4  ┆ d          ┆ 4           │
# └──────┴────────────┴─────────────┘

It's a bit awkward having to use a variable to avoid repeating the longer expression.

cmdlineluser avatar Feb 01 '24 11:02 cmdlineluser