polars icon indicating copy to clipboard operation
polars copied to clipboard

Rename `rle()` struct fields to `len` and `value`

Open cmdlineluser opened this issue 1 year ago • 3 comments

Description

Remove the plural from Series.rle() and Expr.rle() field names.

(Similar to what was done for value_counts: https://github.com/pola-rs/polars/issues/11462)

Current:

>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ lengths ┆ values │
│ ---     ┆ ---    │
│ i32     ┆ i64    │
╞═════════╪════════╡
│ 2       ┆ 1      │
│ 1       ┆ 2      │
│ 1       ┆ 3      │
└─────────┴────────┘

Desired:

>>> pl.Series([1, 1, 2, 3]).rle().struct.unnest()
shape: (3, 2)
┌─────────┬────────┐
│ len     ┆ value  │
│ ---     ┆ ---    │
│ i32     ┆ i64    │
╞═════════╪════════╡
│ 2       ┆ 1      │
│ 1       ┆ 2      │
│ 1       ┆ 3      │
└─────────┴────────┘

(Choosing len to match up with list.len())

cmdlineluser avatar Mar 22 '24 13:03 cmdlineluser

Side note: Would it make sense for rle() to also return the row index? {index, value, len}

The particular use-case being wanting the original row index after performing a .filter()

df = pl.DataFrame({"foo": ["a", "a", "a", "b", "c", "c"]})

df.select(pl.col("foo").rle())
# shape: (3, 1)
# ┌───────────┐
# │ foo       │
# │ ---       │
# │ struct[2] │
# ╞═══════════╡
# │ {3,"a"}   │
# │ {1,"b"}   │
# │ {2,"c"}   │
# └───────────┘

We can calculate it from the length, but it's a little awkward:

(df.select(pl.col("foo").rle())
   .with_columns(
      index = pl.col("foo").struct["lengths"].cum_sum().shift().fill_null(0)
   )
   #.filter(...)
)
# shape: (3, 2)
# ┌───────────┬───────┐
# │ foo       ┆ index │
# │ ---       ┆ ---   │
# │ struct[2] ┆ i32   │
# ╞═══════════╪═══════╡
# │ {3,"a"}   ┆ 0     │
# │ {1,"b"}   ┆ 3     │
# │ {2,"c"}   ┆ 4     │
# └───────────┴───────┘

cmdlineluser avatar Mar 22 '24 13:03 cmdlineluser

Agreed on the rename.

I don't think the index should be part of the RLE method by default. It is not an essential part of the RLE definition. Though possibly an include_index parameter would make sense - but please open a separate issue for that.

stinodego avatar Mar 22 '24 14:03 stinodego

@cmdlineluser I am tempted to also change the field order of the struct to value/len. That way it matches value_counts. What do you think?

EDIT: Nevermind, it's probably not a good idea as the standard RLE places len before value, e.g. 12W1B12W3B24W1B14W.

stinodego avatar Mar 23 '24 07:03 stinodego