polars
polars copied to clipboard
Rename `rle()` struct fields to `len` and `value`
Description
Remove the plural from Series.rle() and Expr.rle() field names.
(Similar to what was done for value_counts: https://github.com/pola-rs/polars/issues/11462)
Current:
>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengths ┆ values │
│ --- ┆ --- │
│ i32 ┆ i64 │
╞═════════╪════════╡
│ 2 ┆ 1 │
│ 1 ┆ 2 │
│ 1 ┆ 3 │
└─────────┴────────┘
Desired:
>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ len ┆ value │
│ --- ┆ --- │
│ i32 ┆ i64 │
╞═════════╪════════╡
│ 2 ┆ 1 │
│ 1 ┆ 2 │
│ 1 ┆ 3 │
└─────────┴────────┘
(Choosing len to match up with list.len())
Side note: Would it make sense for rle() to also return the row index? {index, value, len}
The particular use-case being wanting the original row index after performing a .filter()
df = pl.DataFrame({"foo": ["a", "a", "a", "b", "c", "c"]})
df.select(pl.col("foo").rle())
# shape: (3, 1)
# ┌───────────┐
# │ foo │
# │ --- │
# │ struct[2] │
# ╞═══════════╡
# │ {3,"a"} │
# │ {1,"b"} │
# │ {2,"c"} │
# └───────────┘
We can calculate it from the length, but it's a little awkward:
(df.select(pl.col("foo").rle())
.with_columns(
index = pl.col("foo").struct["lengths"].cum_sum().shift().fill_null(0)
)
#.filter(...)
)
# shape: (3, 2)
# ┌───────────┬───────┐
# │ foo ┆ index │
# │ --- ┆ --- │
# │ struct[2] ┆ i32 │
# ╞═══════════╪═══════╡
# │ {3,"a"} ┆ 0 │
# │ {1,"b"} ┆ 3 │
# │ {2,"c"} ┆ 4 │
# └───────────┴───────┘
Agreed on the rename.
I don't think the index should be part of the RLE method by default. It is not an essential part of the RLE definition. Though possibly an include_index parameter would make sense - but please open a separate issue for that.
@cmdlineluser I am tempted to also change the field order of the struct to value/len. That way it matches value_counts. What do you think?
EDIT: Nevermind, it's probably not a good idea as the standard RLE places len before value, e.g. 12W1B12W3B24W1B14W.