polars icon indicating copy to clipboard operation
polars copied to clipboard

`split_exact` suggestions

Open physinet opened this issue 3 years ago • 1 comments

I'm looking for a feature similar to Python's maxsplit option in str.split. split_exact appears to address a similar need, although it looks like it splits on every occurrence of the separator and only returns the first n+1 segments:

>>> "a_b_c_d".split("_", maxsplit=1)
['a', 'b_c_d']
>>> pl.Series(["a_b_c_d"]).str.split_exact("_", 1)
shape: (1,)
Series: '' [struct[2]]
[
	{"a","b"}
]

I'm not sure if this was the intended behavior of split_exact, but I would find it more useful if the result matched python's maxsplit behavior. Either way, I would love for this to be the default or a configurable behavior.

The struct-type return is also a little awkward, and I'd prefer a list instead. Is that a reasonable change to make?

physinet avatar Aug 10 '22 19:08 physinet

The struct-type return is also a little awkward, and I'd prefer a list instead. Is that a reasonable change to make?

A list type has a more expensive memory format as it is designed to deal with varying length elements. A struct can be decomposed in all fields zero cost.

Regarding the max_split. The function seems different, maybe we can add that as well.

ritchie46 avatar Aug 11 '22 08:08 ritchie46

Addressed via #4373

physinet avatar Aug 25 '22 18:08 physinet