polars Keep original order of rows for polars.cut()

It would be very useful to be able to keep the order of rows from the original series/column when using pl.cut()

Aug 06 '22 07:08 tzeitim

Let me see if I can help. Are you suggesting something like a maintain_order keyword on cut?

For demonstration, let's decorate the existing polars.cut to add a maintain_order keyword:

from typing import Optional

import polars as polars

def cut(
    s: polars.internals.series.Series,
    bins: list[float],
    labels: Optional[list[str]] = None,
    break_point_label: str = "break_point",
    category_label: str = "category",
    maintain_order: bool = False,
) -> polars.internals.frame.DataFrame:

    if maintain_order:
        _arg_sort = polars.Series(name="_arg_sort", values=s.argsort())

    result = polars.cut(s, bins, labels, break_point_label, category_label)

    if maintain_order:
        result = (
            result
            .select([
                polars.all(),
                _arg_sort,
            ])
            .sort('_arg_sort')
            .drop('_arg_sort')
        )

    return result

Now, if we start with a series like this:

my_series = polars.Series(
    name="my_series",
    values=[4.0, 1, 3, 4, 4, 1],
)
my_series

shape: (6,)
Series: 'my_series' [f64]
[
        4.0
        1.0
        3.0
        4.0
        4.0
        1.0
]

We could maintain the original order of the Series with:

cut(my_series, [2, 4], maintain_order=True)

>>> cut(my_series, [2, 4], maintain_order=True)
shape: (6, 3)
┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
└───────────┴─────────────┴─────────────┘

I could see where the above would be helpful when the Series was derived from a large DataFrame. If cut can restore the original order, then hstack can be used to add the categorical variable created by cut directly back to the original DataFrame.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

And for those who don't want the additional overhead of restoring the original order:

cut(my_series, [2, 4])

┌───────────┬─────────────┬─────────────┐
│ my_series ┆ break_point ┆ category    │
│ ---       ┆ ---         ┆ ---         │
│ f64       ┆ f64         ┆ cat         │
╞═══════════╪═════════════╪═════════════╡
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1.0       ┆ 2.0         ┆ (-inf, 2.0] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4.0       ┆ 4.0         ┆ (2.0, 4.0]  │
└───────────┴─────────────┴─────────────┘

If the above is suitable, I would politely recommend that maintain_order=False be the default, due to the additional overhead of restoring the original order to the data. As an example, polars.cut is being used to create histograms for exploratory data analysis #4240.

(
    cut(my_series, [2, 4])
    .groupby('category')
    .count()
    .sort('category')
)

shape: (2, 2)
┌─────────────┬───────┐
│ category    ┆ count │
│ ---         ┆ ---   │
│ cat         ┆ u32   │
╞═════════════╪═══════╡
│ (-inf, 2.0] ┆ 2     │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ (2.0, 4.0]  ┆ 4     │
└─────────────┴───────┘

In the above case, restoring the original order does not help with the histogram, but represents a performance penalty.

Aug 06 '22 23:08 cbilot

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as default makes sense in order to have the most performant variation of the function on top, specially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :

df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))

And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

Aug 07 '22 09:08 tzeitim

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

Sep 04 '22 15:09 hpux735

When this is implemented, please let me know and I'll update the Rust version of cut() to match the behavior.

I think we can now actually replace the python function with your work. ;)

Sep 04 '22 18:09 ritchie46

Thanks for the answer! Yes, my suggestion/feature request is to include an argument to cut to preserve the original order for the exact reason you mentioned (stacking a column to a pre-existing data frame).

I agree that having maintain_order=False as a default makes sense in order to have the most performant variation of the function on top, especially when invoked as a standalone function (e.g. pl.cut()) but I am not so sure in the context of an expression (if at some point cut gets to that level), e.g :
df.with_column(pl.col('whatever_column').cut(bins=[1.1, 2, 10, 100]))
And well, even in this scenario the default option could still be maintain_order=False, just like in groupby.

In any case, a keyword option to maintain order would be really useful.

Exactly what I was trying to achieve and took me some time before falling here. I agree that it would be a really nice behavior so we can "binarize" a column and stack the result.

(Another workaround is to sort the original DataFrame by the series used in cut, and then hstack the results of the existing polars.cut ... but that potentially means sorting a large DataFrame with many columns.)

However, doing this in my use case worked perfectly (yes it's slower but for my work was ok).

Maybe having a note in the documentation that indicates that the function does not keep order (unfortunately assumed by beginners like me) ?

Thanks for your work!

Nov 03 '22 14:11 PierreSnell

Just hit the same issue! pl.col().cut() would be highly appreciated :)

Nov 12 '22 00:11 Hoeze

I would like to add that it would be nice to have an option to "autocut" as well, where we can just tell how many bins and it would decide the break points using an uniform distribution based on count.

Dec 04 '22 11:12 ArthurJ

I think aside from the main topic of having a feature to maintain order, it’s also important to make clear in the docs that the current version does not. It’s all too easy to assume wrong and get incorrect results without noticing, like I did.

Jan 26 '23 03:01 a-reich

Be aware of #7058

Feb 28 '23 12:02 tzeitim

I just discovered that maintain_order=False is the default. If the order is not maintained, the only way to use the cutted column is to join it back onto the original dataframe. In that case, cut should only return unique rows. The current behavior doesn't make any sense.

Jun 26 '23 20:06 s-banach

polars polars copied to clipboard

Keep original order of rows for polars.cut()

polars
polars copied to clipboard