ibis
ibis copied to clipboard
bug: make `order_by()` have an effect after `group_by()` when aggregating
What happened?
Started in https://github.com/ibis-project/ibis/issues/9710, but this issue is for one specific fix.
import ibis
from ibis import _
ibis.options.interactive = True
orders = ibis.memtable({"cust": [1, 1, 1, 2, 2], "date": [3, 2, 1, 9, 8], "amount": [1, 2, 3, 4, 5]})
e2 = orders.group_by("cust").order_by("date").agg(earliest_first=_.amount.collect())
print(ibis.to_sql(e2))
yields (note there is no ORDER BY)
SELECT
"t0"."cust",
ARRAY_AGG("t0"."amount") FILTER(WHERE
"t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
1
I want the agg to be ordered.
Question: Where should the ORDER BY be inserted? Do we need to push this down into individual aggs, ie
SELECT
"t0"."cust",
ARRAY_AGG("t0"."amount" ORDER BY "t0"."date" ASC) FILTER(WHERE
"t0"."amount" IS NOT NULL) AS "earliest_first"
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
GROUP BY
1
?
If so, then this might have to depend upon https://github.com/ibis-project/ibis/issues/9170 to land first?
If I understand #9710 correctly, if we did the ORDER BY in a subquery before the groupby, ie
SELECT
"t1"."cust",
ARRAY_AGG("t1"."amount") FILTER(WHERE
"t1"."amount" IS NOT NULL) AS "earliest_first"
FROM (
SELECT
*
FROM "ibis_pandas_memtable_c6coi73fkzerleojlnpmmwq2va" AS "t0"
ORDER BY
"t0"."date" ASC
) AS "t1"
GROUP BY
1
then some backends don't give any guarantees that this works as expected.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Thanks for raising the issue.
Personally, I prefer to the option 1: insert the orderings into the individual aggs for the following reasons:
- it represents the users intent (wants the rows in
aggordered) and guarantees the ordering for the aggs - it aligns with the
mutateandselectforGroupedTable
We need to wait #9170 merged.
How do you think @cpcloud @gforsyth
I think that makes sense. Aligning with the "projected" form of this construction for consistency seems like a good thing to have.
The tricky bits here are going to be ignoring aggregates that don't support ordering.
One approach to make that easier might be to allow every aggregate operation to be ordered, i.e., give the aggregation base class an order_by field, but only expose an order_by argument where it makes sense, e.g. collect would have it, sum wouldn't. This would make the op changes relatively minor, which the main changes being in the compilers.
Make sure to look at how ordering key operations are spelled as inputs to the Sort operation. That should give some idea of how to spell them for aggregations.
Also make sure to look at what sqlglot does, with sg.parse_one('a query that contains order by in the agg') to see how to construct the various ordering objects it requires.
I think I disagree here, but I also don't really understand why our GroupedTable object exposes an order_by method at all. Looking through the code it's not clear to me when the orderings attribute on the expr comes into effect. What's the current use of this method (probably a question for @cpcloud)?
In my mental model, an order_by on a table-like tells how to order the output table. In contrast, the ordering in an aggregate like collect describes how to perform the aggregate internally, but has no effect on the output ordering in the larger expression. This means that you can order an aggregate like collect by something completely different than the output table ordering (this example based on a WIP branch for #9170):
In [1]: import ibis
In [2]: t = ibis.table({"x": "string", "y": "string", "z": "string"})
In [3]: from ibis import _
In [4]: expr = (
...: t
...: .group_by("x")
...: .agg(ys=_.y.collect(order_by="z")) # order the collect agg by z
...: .order_by("x") # order the output table by x
...: )
In [5]: ibis.to_sql(expr)
Out[5]:
SELECT
*
FROM (
SELECT
"t0"."x",
ARRAY_AGG("t0"."y" ORDER BY "t0"."z" ASC) FILTER(WHERE
"t0"."y" IS NOT NULL) AS "ys"
FROM "unbound_table_0" AS "t0"
GROUP BY
1
) AS "t1"
ORDER BY
"t1"."x" ASC
If I understand the semantics @cpcloud is describing above, the above query would be equivalent to:
In [6]: expr = (
...: t
...: .group_by("x")
...: .order_by("z") # order the aggregation by z
...: .agg(ys=_.y.collect()) # collect now implicitly ordered by z?
...: .order_by("x") # order the output table by x
...: )
I personally would find this confusing. We don't support implicit persistent ordering in the non-grouped context (t.order_by("z").agg(ys=_.y.collect()) won't implicitly use the z ordering), doing so in the grouped context would feel inconsistent and surprising. I also find explicitly passing the ordering that an aggregate should use when performing it clearer that the ordering affects the aggregation result, but not the output ordering of the table itself.
Have similar feeling, Not sure if it is good to remove this order_by() method and orderings attribute from the GroupedTable because:
- It is confusing where the ordering should be performed (inside or outside aggregation), if we have to keep the
order_by, I prefer to keep it inside the aggs. - Now we'll explicitly support
order_byinagg.
but I also don't really understand why our GroupedTable object exposes an order_by method at all
The current use case of
t.group_by(g).order_by(o)
is automatic windowization (broadcasting in numpy lingo) of any aggregates in a subsequent mutate call.
t.group_by(g).order_by(o).mutate(t.x.sum())
desugars to
t.mutate(t.x.sum().over(group_by=g, order_by=o))
This is taken directly from dplyr's mutate + arrange semantics.
... This is taken directly from dplyr's
mutate+arrangesemantics.
I personally find the desugared version easier to read and understand, but :shrug:.
If we want to keep this method around, one "fix" would be to error on a method that converts a GroupedTable to a Table if there are orderings specfied but not used. So t.group_by(g).order_by(o).mutate(...) would still work, but t.group_by(g).order_by(o).agg(...) would error in an informative way telling the user what they probably wanted to do. This is easy to do, keeps our API explicit, and lets the user know that the order_by method they called isn't doing anything.
IMO it seems odd to have two nearly identical constructions, one that has well-defined behavior, and one that errors even though it could have well-defined behavior.
I think I disagree that it has well defined/consistent behavior. There are many implications from having an order_by(...).agg(...) propagate the previous ordering into the aggregates:
- We'd need to do it in both the grouped and ungrouped aggregation contexts (if
t.group_by("c").order_by("a").agg(bs=_.b.collect())makes the aggregate ordered bya, thent.order_by("a").agg(bs=_.b.collect())should as well). - What about passing an aggregation to a
mutatecall (t.order_by("a").mutate(bs=_.b.collect()))? Is that collect ordered bya? - What about calls between the
order_byandagg? Does the order still propagate? We're entering pandas/polars implicit ordering semantics, which is something we've avoided so far.
could we do this
t.group_by("c").mutate(bs=_.b.collect(order_by=_.a))
e2 = orders.group_by("cust").mutate(earliest_first=_.amount.collect(order_by=_.date))
loh, it is no
SELECT
"t0"."cust",
"t0"."date",
"t0"."amount",
ARRAY_AGG("t0"."amount") FILTER(WHERE
"t0"."amount" IS NOT NULL) OVER (PARTITION BY "t0"."cust" ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS "earliest_first"
FROM "ibis_pandas_memtable_45dx6ahfyfbithx2xczy7lqmc4" AS "t0"