polars icon indicating copy to clipboard operation
polars copied to clipboard

Improve string split API and DataTypes (`split`, `splitn`, `split_exact`)

Open Julian-J-S opened this issue 8 months ago • 2 comments

Description

I wanted to bring this back once more before "locking in the API" with V1.0 (see also #11640, #13649)

Example

pl.DataFrame({"str": ["hello world !", "a b c d e"]}).with_columns(
    split=pl.col("str").str.split(" "),
    split_exact=pl.col("str").str.split_exact(" ", n=2),
    splitn=pl.col("str").str.splitn(" ", n=2),
)

# shape: (2, 4)
# ┌───────────────┬───────────────────────────┬───────────────────────┬─────────────────────┐
# │ str           ┆ split                     ┆ split_exact           ┆ splitn              │
# │ ---           ┆ ---                       ┆ ---                   ┆ ---                 │
# │ str           ┆ list[str]                 ┆ struct[3]             ┆ struct[2]           │
# ╞═══════════════╪═══════════════════════════╪═══════════════════════╪═════════════════════╡
# │ hello world ! ┆ ["hello", "world", "!"]   ┆ {"hello","world","!"} ┆ {"hello","world !"} │
# │ a b c d e     ┆ ["a", "b", "c", "d", "e"] ┆ {"a","b","c"}         ┆ {"a","b c d e"}     │
# └───────────────┴───────────────────────────┴───────────────────────┴─────────────────────┘

Problems

  • splitn and split_exact have same signature but parameters behave different (confusing!) (explained in detail here #11640)
  • splitn and split_exact should probably return an Array which is more appropriate for same type fixed list with unknown names
  • splitn should probably be called split_n according to polars converntion

Suggested Improvement

  • consolidate splitn and split_exact (more detailed suggestion here #11640 or #13649)
  • change return type of splitn / split_exact to Array

Julian-J-S avatar May 29 '24 09:05 Julian-J-S