polars Support `ORDER BY` in conjunction with `over` method

Suppose I have the following dataframe:

df = pl.DataFrame(dict(
    g = [1, 1, 1, 1, 2, 2, 2, 2],
    t = [1, 2, 3, 4, 4, 3, 2, 1],
    x = [10, 20, 30, 40, 10, 20, 30, 40]
))

shape: (8, 3)
┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 3   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 1   ┆ 40  │
└─────┴─────┴─────┘

I would like to perform the equivalent of the following SQL query containing a window function:

SELECT
    g, t, x,
    LAG(x) OVER (
        PARTITION BY g
        ORDER BY t
    )
FROM
    df
ORDER BY
    g, t

Here's a DB Fiddle example that demonstrates this SQL query.

A naive attempt to use sort_by does not produce the expected result:

In [53]: df.with_columns(
    ...:     x_lag = pl.col("x").sort_by("t").shift(1).over("g")
    ...: )
Out[53]:
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ null  │
│ 2   ┆ 3   ┆ 20  ┆ 40    │
│ 2   ┆ 2   ┆ 30  ┆ 30    │
│ 2   ┆ 1   ┆ 40  ┆ 20    │
└─────┴─────┴─────┴───────┘

As far as I can tell, the only way to get the desired output is to ensure that the dataframe is sorted before the window function is applied. This produces the desired result:

In [52]: df.sort("g", "t").with_columns(
    ...:     x_lag = pl.col("x").shift(1).over("g")
    ...: )
Out[52]:
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
└─────┴─────┴─────┴───────┘

However, there are some issues with this. One problem is that procedural semantics are leaking into a declarative API. The dataframe must first be sorted, then the window function must be applied. Another problem is that it makes it risky to factor out column expressions into separate functions. I often factor out Polars expressions into their own functions, like this:

def x_lag():
    return pl.col("x").shift(1).over("g")

But this is very brittle, because the x_lag() expression will produce the intended result only if the dataframe has already been sorted by g and t. In my opinion, there ought to be an API that allows a column expression to specify the ORDER BY clause for a window function so that a window function column expression is self-contained and does not rely on the prior state of the dataframe.

May 03 '23 16:05 CameronBieganek

And then the other columns would stay as is? Or do you want the ORDER BY to sort the other columns iff a window function has an order by clause?

May 03 '23 18:05 ritchie46

Yes, the other columns would stay as is. I'm not actually requesting the output dataframe to be sorted in any way. So, the ideal output would be this:

shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
└─────┴─────┴─────┴───────┘

May 03 '23 18:05 CameronBieganek

Right, if the sorting remains within that column, it would adhere to our expression rules and I would be willing to add this.

If I have capacity, I can look into this. Or someone is free to make a PR. But if you do, let's discuss first, I am opinionated on window functions and their performance.

May 03 '23 18:05 ritchie46

Hang on, your example where it doesn't work only sorts by t. If you add in both g and t you get this:

df.with_columns(x_lag = pl.col('x').sort_by(['g', 't']).shift(1).over('g'))

shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ null  │
│ 2   ┆ 3   ┆ 20  ┆ 40    │
│ 2   ┆ 2   ┆ 30  ┆ 30    │
│ 2   ┆ 1   ┆ 40  ┆ 20    │
└─────┴─────┴─────┴───────┘

Which looks correct to me. Columns g, t, and x remain in their initial ordering, and x_lag is computed as if the dataframe were first sorted on g and t and then lagged.

It sounds like you're asking that a single column's sort_by affects other columns' ordering too--in other words, you want the column's expression to perform a dataframe-level sort. This to me sounds both dangerous and super messy. What if two columns request a dataframe-wide sort with different ordering, who wins?

May 03 '23 20:05 mcrumiller

@CameronBieganek I realize I didn't give enough thought to your original request (I apologize), which was to replicate the query:

SELECT
    g, t, x,
    LAG(x) OVER (
        PARTITION BY g
        ORDER BY t
    )
FROM
    df
ORDER BY
    g, t

There's a big disconnect between SQL and polars here. In SQL, we are returned records which are tuples of the form (g, t, x, lag(x)), whereas in polars we get columns g, t, x, lag(x) that are independent. A SQL SELECT statement returns a sequence of rows, whereas polars expressions return a sequence of columns. The issue here stems from the fact that, in SQL, LAG(x) OVER .... is tupled with specific g, t, and x values, whereas in polars, it's only tupled with those values during the context of the expression, and then those values are discarded.

This is a much more general issue than just this one example.

May 04 '23 01:05 mcrumiller

One additional thought on top of the point of @mcrumiller.

SQL and Polars use different paradigms , and I like that Polars treats each column as an independent from the others: so the calculations can be run in parallel , and so we don’t need to worry if a new column could change (example: reorder) the existing ones

that said, I believe that when Polars supports SQL it makes sense to support the SQL paradigm too

so if SQL engines typically reorder prior columns , then (in my opinion) Polars should do it too

why?

ease of moving SQL code across different engines
ease for people to reuse their SQL skills with the Polars engine

May 04 '23 13:05 lucazanna

Those are distinct things. Polars witg a SQL front end should behave as tgat SQL flavor.

However. Polars the DSL is different and expressions may never have side effects. The expressions can serve as building blocks together with other primitives to get such behavior. We could decise this requires a new method or that we have a function that behaves like that in SQL only.

However, we will not change our expression rules because SQL works differently.

The DSL and SQL differ. That doesn't matter as the DSL has all the building blocks required to build an SQL front-end that behaves as expected.

May 04 '23 16:05 ritchie46

Which looks correct to me. Columns g, t, and x remain in their initial ordering, and x_lag is computed as if the dataframe were first sorted on g and t and then lagged.

Just to make sure we're on the same page, the output you show when using sort_by(["g", "t"]) is not the output I'm looking for. The output I'm looking for is described in my second comment above, in response to Ritchie's initial question. (Although it sounds like you've already realized this.) Also, to be clear, I'm not requesting any sorting of the other columns in the dataframe. I'm only requesting to be able to do an ... OVER (PARTITION BY ... ORDER BY ...) to create a new column. (I clarified this in the second comment above.)

The ... OVER (PARTITION BY ... ORDER BY ...) pattern that I describe in my original post is a very common pattern when you are working with grouped time-series data, so it would be nice if Polars had a way of expressing this as a column expression.

In SQL, we are returned records which are tuples of the form (g, t, x, lag(x)), whereas in polars we get columns g, t, x, lag(x) that are independent.

It's funny you should mention that. I was thinking about opening an issue to discuss the independent-columns behavior of Polars. So I've gone ahead and done that now: #8680. The gist of that post is that the independent-columns semantics allows the user to write a wide range of non-sensical queries. However, I acknowledge that this design is pretty deeply embedded in Polars, so it's not likely to change.

May 04 '23 16:05 CameronBieganek

I'm only requesting to be able to do an ... OVER (PARTITION BY ... ORDER BY ...) to create a new column

but you did create that column, x_lag, and it's identical in both your "incorrect" and "correct" examples. The calculation worked just fine.

May 04 '23 17:05 mcrumiller

The calculation worked just fine.

No. The correct result set (which you can get by first sorting the whole data frame, and then applying the window function) for g==2 is this:

(2, 4, 10, 20)
(2, 3, 20, 30)
(2, 2, 30, 40)
(2, 1, 40, null)

The result set for g==2 returned by using pl.col("x").sort_by(["g", "t"]).shift(1).over("g") is

(2, 4, 10, null)
(2, 3, 20, 40)
(2, 2, 30, 30)
(2, 1, 40, 20)

That is not the same result set.

Aside from sorting the whole dataframe first and then applying the window function, the other way to get the correct result set is to do something goofy like this:

In [20]: df.select(
    ...:     pl.col("g", "t", "x").sort_by(["g", "t"]),
    ...:     x_lag = pl.col("x").sort_by(["g", "t"]).shift(1).over("g")
    ...: )
Out[20]:
shape: (8, 4)
┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
└─────┴─────┴─────┴───────┘

May 04 '23 18:05 CameronBieganek

Right, but we're not talking about a result set here, we're talking about column x_lag; that's what polars returns. I understand the rest of the dataframe is different than the SQL approach, but that's what we both expect and want. Otherwise, it would mean that the calculation on one column affects the other columns. And if that's what you want, then what you're asking for is a dataframe operation. Why doesn't the dataframe approach work then? Your final "goofy" query is basically exactly the same thing as the dataframe sort first.

May 04 '23 18:05 mcrumiller

Right, but we're not talking about a result set here

If we're not allowed to talk about result sets, then it becomes quite difficult to talk about table operations in a logical and consistent manner.

Otherwise, it would mean that the calculation on one column affects the other columns. And if that's what you want

For the n-th time, that is not what I want! I don't see why it shouldn't be possible to start with this data frame,

┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 3   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 1   ┆ 40  │
└─────┴─────┴─────┘

and then tack on a calculated column like this:

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
└─────┴─────┴─────┴───────┘

Notice that the original columns are completely unchanged! There are no side-effects of calculating the x_lag column.

May 04 '23 19:05 CameronBieganek

Sorry, I'm not trying to be antagonistic here, I think some information is getting lost in the back-and-forth. In your most recent example, the original columns are unchanged, but that is not the same frame as your "desired result" above. Your desired result changes column in the original post changed column x.

May 04 '23 19:05 mcrumiller

Unless I'm taking crazy pills, I think you were just missing the second sort_by:

df = pl.DataFrame(dict(
    g = [1, 1, 1, 1, 2, 2, 2, 2],
    t = [1, 2, 3, 4, 4, 3, 2, 1],
    x = [10, 20, 30, 40, 10, 20, 30, 40]
))

df.with_columns(
    x_lag = pl.col("x").sort_by("t").shift(1).over("g").sort_by("g","t")
)
# ┌─────┬─────┬─────┬───────┐
# │ g   ┆ t   ┆ x   ┆ x_lag │
# │ --- ┆ --- ┆ --- ┆ ---   │
# │ i64 ┆ i64 ┆ i64 ┆ i64   │
# ╞═════╪═════╪═════╪═══════╡
# │ 1   ┆ 1   ┆ 10  ┆ null  │
# │ 1   ┆ 2   ┆ 20  ┆ 10    │
# │ 1   ┆ 3   ┆ 30  ┆ 20    │
# │ 1   ┆ 4   ┆ 40  ┆ 30    │
# │ 2   ┆ 4   ┆ 10  ┆ 20    │
# │ 2   ┆ 3   ┆ 20  ┆ 30    │
# │ 2   ┆ 2   ┆ 30  ┆ 40    │
# │ 2   ┆ 1   ┆ 40  ┆ null  │
# └─────┴─────┴─────┴───────┘

And you could then apply a final frame-level sort to this result if you wanted, eg: .sort(by=["g","t"]).

May 04 '23 20:05 alexander-beedie

In your most recent example, the original columns are unchanged, but that is not the same frame as your "desired result" above.

Yes, I was somewhat imprecise in the original post. When I said "desired result" in the original post, I meant that sorting the dataframe first and then applying the window function produces the correct result set. The change to the row order in that "desired result" was a side effect that I didn't necessarily want, but at least the result set was correct. However, I clarified in my second comment (the third comment overall in this thread) that the output dataframe I actually would expect to see with proper ORDER BY support is

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
└─────┴─────┴─────┴───────┘

May 04 '23 20:05 CameronBieganek

I think you were just missing the second sort_by

That is interesting! I don't quite get it though... It's definitely not the first thing I would reach for.

May 04 '23 20:05 CameronBieganek

@CameronBieganek ahh, I see what you're saying now. It looks like we have @alexander-beedie to the rescue.

May 04 '23 20:05 mcrumiller

I have a feeling I could come up with a counter-example to @alexander-beedie's approach. Let me think about it...

May 04 '23 20:05 CameronBieganek

@CameronBieganek Here's my attempt at an explanation. We start with:

┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 3   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 1   ┆ 40  │
└─────┴─────┴─────┘

Pictures from this point onward will be from the context of with_columns:

df.with_columns(
    x_lag = pl.col("x").sort_by("t")
)

so the df internal to the expression context is:

┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 2   ┆ 1   ┆ 40  │
│ 1   ┆ 2   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 1   ┆ 3   ┆ 30  │
│ 2   ┆ 3   ┆ 20  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
└─────┴─────┴─────┘

Add the shift with over:

df.with_columns(
    x_lag = pl.col("x").sort_by("t").shift(1).over("g")
)

┌─────┬─────┬──────┐
│ g   ┆ t   ┆ x    │
╞═════╪═════╪══════╡
│ 1   ┆ 1   ┆ null │ *
│ 2   ┆ 1   ┆ null │
│ 1   ┆ 2   ┆ 10   │ *
│ 2   ┆ 2   ┆ 40   │
│ 1   ┆ 3   ┆ 20   │ *
│ 2   ┆ 3   ┆ 30   │
│ 1   ┆ 4   ┆ 30   │ *
│ 2   ┆ 4   ┆ 20   │
└─────┴─────┴──────┘

I marked the g==1 rows so we can view the shift a little more easily. Note that the g that's used in the over() is the g from the original dataframe, not the g from inside the expression context. Finally, re-sort:

df.with_columns(
    x_lag = pl.col("x").sort_by("t").shift(1).over("g").sort_by("g","t")
)

┌─────┬─────┬──────┐
│ g   ┆ t   ┆ x    │
╞═════╪═════╪══════╡
│ 1   ┆ 1   ┆ null │
│ 1   ┆ 2   ┆ 10   │
│ 1   ┆ 3   ┆ 20   │
│ 1   ┆ 4   ┆ 30   │
│ 2   ┆ 1   ┆ 20   │     <--- NOTE: our column `x` is being sorted
│ 2   ┆ 2   ┆ 30   │          by the ACTUAL column g/t in the original df, not by
│ 2   ┆ 3   ┆ 40   │          the df internal to the expression.
│ 2   ┆ 4   ┆ null │
└─────┴─────┴──────┘

Now finally, the expression context returns the modified output column x as x_lag, which gets appended to our original dataframe:

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
│ 2   ┆ 3   ┆ 20  ┆ 30    │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 1   ┆ 40  ┆ null  │
└─────┴─────┴─────┴───────┘

Does this make sense?

May 04 '23 20:05 mcrumiller

I updated my above explanation, which was incorrect before. I think I've got it right now. Here is a piece of code that helps understand: when an expression references other columns, it uses the columns from the actual external dataframe, not the internal column. Consider the following example:

import polars as pl
from polars import col

df = pl.DataFrame({
    'a': ['a', 'b', 'c', 'd', 'e'],
    'b': [2, 3, 4, 5, 1]
})

df.select(col('a').sort_by('b'))

┌─────┐
│ a   │
╞═════╡
│ e   │
│ a   │
│ b   │
│ c   │
│ d   │
└─────┘

Makes sense. What if we sort twice?

df.select(col('a').sort_by('b').sort_by('b'))

┌─────┐
│ a   │
╞═════╡
│ d   │
│ e   │
│ a   │
│ b   │
│ c   │
└─────┘

It's different! The reason is because both sort_by('b') use the original b to sort, not the internally-represented b. Does this help?

May 04 '23 20:05 mcrumiller

Thanks for the explanation. However, I think I found an example where the double sort_by trick doesn't work. Here's the input dataframe:

df = pl.DataFrame(dict(
    g = [1, 1, 1, 1, 2, 2, 2, 2],
    t = [1, 2, 3, 4, 4, 1, 2, 3],
    x = [10, 20, 30, 40, 10, 20, 30, 40]
))

┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 1   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 3   ┆ 40  │
└─────┴─────┴─────┘

Here's the expected output:

 expected = pl.DataFrame(dict(
    g = [1, 1, 1, 1, 2, 2, 2, 2],
    t = [1, 2, 3, 4, 4, 1, 2, 3],
    x = [10, 20, 30, 40, 10, 20, 30, 40],
    x_lag = [None, 10, 20, 30, 40, None, 20, 30]
))

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 40    │
│ 2   ┆ 1   ┆ 20  ┆ null  │
│ 2   ┆ 2   ┆ 30  ┆ 20    │
│ 2   ┆ 3   ┆ 40  ┆ 30    │
└─────┴─────┴─────┴───────┘

And here's the output of the double sort_by:

out = df.with_columns(
    x_lag = pl.col("x").sort_by("t").shift(1).over("g").sort_by("g", "t")
)

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
│ --- ┆ --- ┆ --- ┆ ---   │
│ i64 ┆ i64 ┆ i64 ┆ i64   │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 20    │
│ 2   ┆ 1   ┆ 20  ┆ 30    │
│ 2   ┆ 2   ┆ 30  ┆ 40    │
│ 2   ┆ 3   ┆ 40  ┆ null  │
└─────┴─────┴─────┴───────┘

So we see that out is not equal to expected:

In [53]: out == expected
Out[53]:
shape: (8, 4)
┌──────┬──────┬──────┬───────┐
│ g    ┆ t    ┆ x    ┆ x_lag │
│ ---  ┆ ---  ┆ ---  ┆ ---   │
│ bool ┆ bool ┆ bool ┆ bool  │
╞══════╪══════╪══════╪═══════╡
│ true ┆ true ┆ true ┆ true  │
│ true ┆ true ┆ true ┆ true  │
│ true ┆ true ┆ true ┆ true  │
│ true ┆ true ┆ true ┆ true  │
│ true ┆ true ┆ true ┆ false │
│ true ┆ true ┆ true ┆ false │
│ true ┆ true ┆ true ┆ false │
│ true ┆ true ┆ true ┆ false │
└──────┴──────┴──────┴───────┘

I think the method we would need to make the double sorting approach work in general is a method that produces the inverse of the permutation produced by sort_by("t"). Note that (I think) the inverse of the sort_by("t") permutation is not sort_by("t", descending=True). If we had an inverse_sort_by method, then I think the following query would return the correct results for arbitrary dataframes with columns g, t, and x:

df.with_columns(
    x_lag = (
        pl.col("x")
        .sort_by("t")
        .shift(1)
        .inverse_sort_by("t")
        .over("g")
    )
)

Note that I've placed the inverse_sort_by before the over, since we want to ensure that all the sorting and shifting happens on a per partition basis.

I think this explains why the double sort_by approach worked on the previous example. For that example, sort_by("t") and sort_by("g", "t") on the g==2 partition happen to be equivalent to a reverse on the g==2 partition of x. Since reverse is the inverse permutation of reverse, we get the correct result.

At any rate, I think it would be more intuitive and more discoverable to have available something like over("g", order_by="t").

May 04 '23 23:05 CameronBieganek

I see what you mean here. You do a lot of sorting within your expression, and you want the modified end result to be un-sorted back to the original dataframe. Here is a trick to do inverse sort. If you provide a column of 0...N in your data frame, then however crazily your sort your frame, you can always use that column to un-sort as long as you make sure to sort that column in the same way.

Here's a working version with "un-sort":

def unsort(*args):
    # this gives a column of index 0...N, sorted the same way as we did for x
    return pl.arange(0, pl.count()).sort_by(*args)

df.with_columns(
    col('x')
    .sort_by('g', 't')
    .shift(1).over('g')
    .sort_by(unsort('g', 't'))
    .alias("x_lag")
)

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ 40    │
│ 2   ┆ 1   ┆ 20  ┆ null  │
│ 2   ┆ 2   ┆ 30  ┆ 20    │
│ 2   ┆ 3   ┆ 40  ┆ 30    │
└─────┴─────┴─────┴───────┘

May 04 '23 23:05 mcrumiller

Nice!

May 05 '23 03:05 CameronBieganek

@mcrumiller: Nice one indeed - while I had a good night's sleep, you have done a great job explaining it!

The TLDR is, yes, the result you initially get back is in the requested calculation order (according to the given partition/sort), and what you want is to map that back into frame order (hence the second/final sort_by). Your unsort looks like a great generalisation for this purpose... kudos :)

May 05 '23 05:05 alexander-beedie

@CameronBieganek If you wanted to go a step further, you can use polars' ability to register namespaces and create a class that tracks all sort operations, so that unsort works like magic. Whenever you call sort, call .save.sort(...) instead, and at the end just call .save.unsort():

import polars as pl
from polars import col

@pl.api.register_expr_namespace("save")
class Unsort:
    # this is a class-level variable so that it lives for the duration of the chained expression
    # it is not thread-safe
    _sort_args = []
    
    def __init__(self, expr):
        self._expr = expr
        
    def sort_by(self, *args, **kwargs):
        # save the sort arguments as a tuple of (args, kwargs)
        self._sort_args.append((args, kwargs))
        
        return self._expr.sort_by(*args,  **kwargs)
    
    def unsort(self):
        """Undo sorting"""
        # our ID vector that we will sort the same way the column has so far been sorted
        expr = pl.arange(0, pl.count())
        
        # apply to our ID vector the sorting so far done to the column
        for args in self._sort_args:
            expr = expr.sort_by(*args[0], **args[1])
            
        # reset for future use
        self._sort_args = [] # reset
        
        # sort the expression by the ID column
        return self._expr.sort_by(expr)
        

df = pl.DataFrame(dict(
    g = [1, 1, 1, 1, 2, 2, 2, 2],
    t = [1, 2, 3, 4, 4, 1, 2, 3],
    x = [10, 20, 30, 40, 10, 20, 30, 40]
))


# print original
print(df)

# sort a bunch, then unsort. It should be the same.
print(
    df.with_columns(col('g').save.sort_by(['g', 't']).save.sort_by('t').save.sort_by('x')
    .save.unsort())
)

shape: (8, 3)
┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 1   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 3   ┆ 40  │
└─────┴─────┴─────┘
shape: (8, 3)
┌─────┬─────┬─────┐
│ g   ┆ t   ┆ x   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 1   ┆ 10  │
│ 1   ┆ 2   ┆ 20  │
│ 1   ┆ 3   ┆ 30  │
│ 1   ┆ 4   ┆ 40  │
│ 2   ┆ 4   ┆ 10  │
│ 2   ┆ 1   ┆ 20  │
│ 2   ┆ 2   ┆ 30  │
│ 2   ┆ 3   ┆ 40  │
└─────┴─────┴─────┘

@ritchie46 or @alexander-beedie , while writing this, I realized I needed to make _sort_args class-level, as every call to .save creates a new class instance, which makes this not thread-safe. Is there a way to identify an "expression chain" by some sort of ID? Is this something that would be easy to implement? In other words, when an expression is chained from a prior expression, use that expression's ID, so that we can identify a chain.

May 05 '23 12:05 mcrumiller

The Unsort class is cool, but I think it would be more intuitive and discoverable to have a syntax like over("g", order_by="t") (or some variant on that). ... OVER (PARTITION BY ... ORDER BY ...) is a very commonly needed idiom when working with grouped time series data, so it seems worthwhile to have a dedicated syntax for it.

May 05 '23 13:05 CameronBieganek

I agree with @CameronBieganek and I also think that having the window function accept a parameter such as order_by would be quite useful.

On this topic - what about adding range_between and rows_between as parameters in addition to order_by?

that way, the window function would be aligned to PySpark in terms of functionalities : https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.Window.html

i know that rows_between is possible using shift and a list comprehension, but having it as parameter would be a nice way to simplify its syntax

this would make the window function even more powerful, especially when combined with the other expressions that Polars offers

May 05 '23 13:05 lucazanna

Edit: Looks like this doesn't quite work, disregard.

@CameronBieganek yeah, I'm realizing what the main request here is. over(g) depends on the ordering of g. You can do this directly by sorting g in the over call:

df.with_columns(
    # note we have to include column g in the sort_by as well, that's ok
    x_lag = col('x').shift(1).over(col('g').sort_by(['g', 't']))
)

┌─────┬─────┬─────┬───────┐
│ g   ┆ t   ┆ x   ┆ x_lag │
╞═════╪═════╪═════╪═══════╡
│ 1   ┆ 1   ┆ 10  ┆ null  │
│ 1   ┆ 2   ┆ 20  ┆ 10    │
│ 1   ┆ 3   ┆ 30  ┆ 20    │
│ 1   ┆ 4   ┆ 40  ┆ 30    │
│ 2   ┆ 4   ┆ 10  ┆ null  │
│ 2   ┆ 1   ┆ 20  ┆ 10    │
│ 2   ┆ 2   ┆ 30  ┆ 20    │
│ 2   ┆ 3   ┆ 40  ┆ 30    │
└─────┴─────┴─────┴───────┘

This should be quite a bit simpler.

May 05 '23 13:05 mcrumiller

Bump. Supporting an ORDER BY concept in .over() is essential to being able to write arbitrary window functions in Polars. Right now the only option is to sort the entire dataframe before applying the window function, which is a suboptimal solution.

Oct 30 '23 19:10 CameronBieganek

In order to work around this issue, I've been sorting the entire dataframe first, and then running my over expressions, like this:

In [1]: import polars as pl

In [2]: df = pl.DataFrame(
   ...:     dict(
   ...:         group = [1, 1, 1, 2, 2, 2],
   ...:         time = [3, 2, 1, 1, 2, 3],
   ...:         x = [11, 12, 13, 14, 15, 16]
   ...:     )
   ...: ).lazy()

In [3]: df.collect()
Out[3]: 
shape: (6, 3)
┌───────┬──────┬─────┐
│ group ┆ time ┆ x   │
│ ---   ┆ ---  ┆ --- │
│ i64   ┆ i64  ┆ i64 │
╞═══════╪══════╪═════╡
│ 1     ┆ 3    ┆ 11  │
│ 1     ┆ 2    ┆ 12  │
│ 1     ┆ 1    ┆ 13  │
│ 2     ┆ 1    ┆ 14  │
│ 2     ┆ 2    ┆ 15  │
│ 2     ┆ 3    ┆ 16  │
└───────┴──────┴─────┘

In [4]: (
   ...:     df
   ...:     .sort("group", "time")
   ...:     .with_columns(
   ...:         x_cum_sum = pl.col("x").cum_sum().over("group")
   ...:     )
   ...:     .collect()
   ...: )
Out[4]:
shape: (6, 4)
┌───────┬──────┬─────┬───────────┐
│ group ┆ time ┆ x   ┆ x_cum_sum │
│ ---   ┆ ---  ┆ --- ┆ ---       │
│ i64   ┆ i64  ┆ i64 ┆ i64       │
╞═══════╪══════╪═════╪═══════════╡
│ 1     ┆ 1    ┆ 13  ┆ 13        │
│ 1     ┆ 2    ┆ 12  ┆ 25        │
│ 1     ┆ 3    ┆ 11  ┆ 36        │
│ 2     ┆ 1    ┆ 14  ┆ 14        │
│ 2     ┆ 2    ┆ 15  ┆ 29        │
│ 2     ┆ 3    ┆ 16  ┆ 45        │
└───────┴──────┴─────┴───────────┘

However, I don't know if that is safe. Is the optimizer allowed to move the sort after the with_columns? This wouldn't be a concern if we had an order_by keyword for over. So, I have two requests:

Can we please add an order_by keyword to over? It is essential for many grouped, time-series operations.
In the meantime, can someone tell me whether the above code is safe? In other words, is it guaranteed that the optimizer won't move the sort to after the with_columns?

Jan 31 '24 17:01 CameronBieganek

polars polars copied to clipboard

Support `ORDER BY` in conjunction with `over` method

polars
polars copied to clipboard