polars
polars copied to clipboard
Improve string split API and DataTypes (`split`, `splitn`, `split_exact`)
Description
I wanted to bring this back once more before "locking in the API" with V1.0 (see also #11640, #13649)
Example
pl.DataFrame({"str": ["hello world !", "a b c d e"]}).with_columns(
split=pl.col("str").str.split(" "),
split_exact=pl.col("str").str.split_exact(" ", n=2),
splitn=pl.col("str").str.splitn(" ", n=2),
)
# shape: (2, 4)
# ┌───────────────┬───────────────────────────┬───────────────────────┬─────────────────────┐
# │ str ┆ split ┆ split_exact ┆ splitn │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ list[str] ┆ struct[3] ┆ struct[2] │
# ╞═══════════════╪═══════════════════════════╪═══════════════════════╪═════════════════════╡
# │ hello world ! ┆ ["hello", "world", "!"] ┆ {"hello","world","!"} ┆ {"hello","world !"} │
# │ a b c d e ┆ ["a", "b", "c", "d", "e"] ┆ {"a","b","c"} ┆ {"a","b c d e"} │
# └───────────────┴───────────────────────────┴───────────────────────┴─────────────────────┘
Problems
-
splitn
andsplit_exact
have same signature but parameters behave different (confusing!) (explained in detail here #11640) -
splitn
andsplit_exact
should probably return anArray
which is more appropriate for same type fixed list with unknown names -
splitn
should probably be calledsplit_n
according to polars converntion
Suggested Improvement
- consolidate
splitn
andsplit_exact
(more detailed suggestion here #11640 or #13649) - change return type of
splitn
/split_exact
toArray