feat(api): support `rollup`, `cube` and `grouping_sets` APIs
Description of changes
WIP PR implementing support for rollup, cube and grouping sets.
For background on what these are used for and how they work, see https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.
In discussing the implementation, I'm going to refer to groupings sets only,
but everything applies to rollups and cubes as well (rollups and cubes are
shorthand for a longer form GROUPING SET).
These concepts represent a new kind of "thing" for Ibis. They are neither column nor scalar expressions that can be evaluated, nor are they tables really.
The approach I took here was to add a non-Value operation for each construct.
These are desugared into tuples of tuples of Value expressions (because
multiple grouping sets specifications are allowed) wherever grouping sets are
supported.
One tricky thing is that I had to partition the key specifically requested in group by, as distinct from those unique value expressions in grouping sets, to support constructs like this:
SELECT a, b
FROM t
GROUP BY ROLLUP (a, b)
which is not equivalent to
SELECT a, b
FROM t
GROUP BY a, b, ROLLUP (a, b)
Right now, I'm looking for feedback on the approach before fleshing out the test suite.
Issues closed
This seems like a reasonable approach to me.
Small feedback that can always be addressed later:
I think it would be nice to allow passing in bare strings, values, or deferreds to grouping_sets, along with tuples of them, e.g.
ibis.grouping_sets(("a", "b"), "a", "b")
this would better mimic the SQL syntax (god, i can't believe I'm suggesting that...) and also avoids the awkward spelling of
ibis.grouping_sets(("a", "b"), ("a",), ("b",))
And on the naming front, I thing we should take a pointer from DuckDB and rename grouping to grouping_id -- I think grouping is going to cause confusion.
but overall seems solid!
All good points, definitely changing grouping -> grouping_id. I'm half tempted to call it group_id, those extra three characters are killing me.
For a full featured and correct implementation, grouping sets will depend on the earliest version of sqlglot contains https://github.com/tobymao/sqlglot/pull/3985.
Ok, now that I think about this some more, I think having separate ibis.rollup/grouping_sets/cube objects adds unnecessary complication.
What do people think about making these keyword arguments to group_by? E.g.,
t.group_by(
...,
rollup=(_.a, "b"),
grouping_sets=("a", "b", ("a", "b")),
cube=(_.c, _.d)
)
The separate objects would make sense if there were some way to use this functionality outside of a group by, but AFAICT they are strictly only valid in that case.
This would also eliminate having to do the partitioning, since the API requires the user spell it out.
This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?
What do people think about making these keyword arguments to
group_by?This would break users who having columns named
rollup,cube, orgrouping_setsbut that seems okay?
I slightly dislike that. Not because it might break existing users (seems unlikely), but because it's harder to visually distinguish keyword-arguments serving as aliases in group_by from those specific kwargs.
t.group_by(one=t.a, cube=(t.b,), three=t.c) # intentionally terrible formatting
Another potential downside is it doesn't let users as easily use multiple of these constructs in the same query. Following this line from the docs:
If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.
With separate objects, I'd to be able to pass in multiple of the same kind of object, while kwargs would force the user to handle the composition of them themselves (confession - I'm new to this whole concept, so apologies if this is a dumb example):
t.group_by(ibis.cube("a", "b"), ibis.cube("c", "d"))
Individual objects also may be easier for programmatic usage - a system would only need a single list of things to group by, rather than splitting out the various grouping set kind of things.
None of these are blockers of course, and all can be worked around. Just noting a few potential downsides.
Hm, all good points. I think I'll stick with the objects for now.
Keeping this as draft until I can at least add some less sophomoric tests.
I wish I hadn't learned this while working through reviewing this, but I guess it's valid to pass a ROLLUP or a CUBE as an element of a GROUPING SET? I don't think that works right now, and maybe we leave that for a follow-up, but I thought we should note it down for ourselves.
Which of course means you can nest grouping sets arbitrarily.
D create table t (a int, b int);
D select 1 from t group by grouping sets (rollup (a, b), a, grouping sets (cube (a, b)));
┌───────┐
│ 1 │
│ int32 │
├───────┤
│ 1 │
│ 1 │
└───────┘
This is a special level of hell.
Another "fun" thing with these constructs is that some backends implement rollup/cube/grouping sets on an empty table such that the "grand total" group (usually spelled ()) return zero rows, and some return one row.
For example:
create table t (c int);
select * from t group by rollup (c);
I think I'm going to cut this from the 10.0 milestone since it's not a breaking change and we need to get a release out.