polars
polars copied to clipboard
Keep original order of rows for polars.cut()
It would be very useful to be able to keep the order of rows from the original series/column when using pl.cut()
Let me see if I can help. Are you suggesting something like a maintain_order
keyword on cut
?
For demonstration, let's decorate the existing polars.cut
to add a maintain_order
keyword:
from typing import Optional
import polars as polars
def cut(
s: polars.internals.series.Series,
bins: list[float],
labels: Optional[list[str]] = None,
break_point_label: str = "break_point",
category_label: str = "category",
maintain_order: bool = False,
) -> polars.internals.frame.DataFrame:
if maintain_order:
_arg_sort = polars.Series(name="_arg_sort", values=s.argsort())
result = polars.cut(s, bins, labels, break_point_label, category_label)
if maintain_order:
result = (
result
.select([
polars.all(),
_arg_sort,
])
.sort('_arg_sort')
.drop('_arg_sort')
)
return result
Now, if we start with a series like this:
my_series = polars.Series(
name="my_series",
values=[4.0, 1, 3, 4, 4, 1],
)
my_series
shape: (6,)
Series: 'my_series' [f64]
[
4.0
1.0
3.0
4.0
4.0
1.0
]
We could maintain the original order of the Series with:
cut(my_series, [2, 4], maintain_order=True)
>>> cut(my_series, [2, 4], maintain_order=True)
shape: (6, 3)
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═══════════╪═════════════╪═════════════╡
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
└───────────┴─────────────┴─────────────┘
I could see where the above would be helpful when the Series was derived from a large DataFrame. If cut
can restore the original order, then hstack
can be used to add the categorical variable created by cut
directly back to the original DataFrame.
(Another workaround is to sort the original DataFrame by the series used in cut
, and then hstack
the results of the existing polars.cut
... but that potentially means sorting a large DataFrame with many columns.)
And for those who don't want the additional overhead of restoring the original order:
cut(my_series, [2, 4])
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ cat │
╞═══════════╪═════════════╪═════════════╡
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0 ┆ 2.0 ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0 ┆ 4.0 ┆ (2.0, 4.0] │
└───────────┴─────────────┴─────────────┘
If the above is suitable, I would politely recommend that maintain_order=False
be the default, due to the additional overhead of restoring the original order to the data. As an example, polars.cut
is being used to create histograms for exploratory data analysis #4240.
(
cut(my_series, [2, 4])
.groupby('category')
.count()
.sort('category')
)
shape: (2, 2)
┌─────────────┬───────┐
│ category ┆ count │
│ --- ┆ --- │
│ cat ┆ u32 │
╞═════════════╪═══════╡
│ (-inf, 2.0] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ (2.0, 4.0] ┆ 4 │
└─────────────┴───────┘
In the above case, restoring the original order does not help with the histogram, but represents a performance penalty.
Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut
to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).
I agree that having maintain_order=False
as default makes sense in order to have the most performant variation of the function on top, specially when invoked as a standalone function (e.g. pl.cut()
) but I am not so sure in the context of an expression (if at some point cut
gets to that level), e.g :
df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))
And well, even in this scenario the default option could still be maintain_order=False
, just like in groupby
.
In any case, a keyword option to maintain order would be really useful.
When this is implemented, please let me know and I'll update the Rust version of cut()
to match the behavior.
When this is implemented, please let me know and I'll update the Rust version of
cut()
to match the behavior.
I think we can now actually replace the python function with your work. ;)
Thanks for the answer! Yes, my suggestion/feature request is to include an argument to
cut
to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).I agree that having
maintain_order=False
as a default makes sense in order to have the most performant variation of the function on top, especially when invoked as a standalone function (e.g.pl.cut()
) but I am not so sure in the context of an expression (if at some pointcut
gets to that level), e.g :df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))
And well, even in this scenario the default option could still be
maintain_order=False
, just like ingroupby
.In any case, a keyword option to maintain order would be really useful.
Exactly what I was trying to achieve and took me some time before falling here. I agree that it would be a really nice behavior so we can "binarize" a column and stack the result.
(Another workaround is to sort the original DataFrame by the series used in
cut
, and thenhstack
the results of the existingpolars.cut
... but that potentially means sorting a large DataFrame with many columns.)
However, doing this in my use case worked perfectly (yes it's slower but for my work was ok).
Maybe having a note in the documentation that indicates that the function does not keep order (unfortunately assumed by beginners like me) ?
Thanks for your work!
Just hit the same issue! pl.col().cut()
would be highly appreciated :)
I would like to add that it would be nice to have an option to "autocut" as well, where we can just tell how many bins and it would decide the break points using an uniform distribution based on count.
I think aside from the main topic of having a feature to maintain order, it’s also important to make clear in the docs that the current version does not. It’s all too easy to assume wrong and get incorrect results without noticing, like I did.
Be aware of #7058
I just discovered that maintain_order=False
is the default.
If the order is not maintained, the only way to use the cutted column is to join it back onto the original dataframe.
In that case, cut
should only return unique rows.
The current behavior doesn't make any sense.