ibis feat(api): support `rollup`, `cube` and `grouping

Description of changes

WIP PR implementing support for rollup, cube and grouping sets.

For background on what these are used for and how they work, see https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-GROUPING-SETS.

In discussing the implementation, I'm going to refer to groupings sets only, but everything applies to rollups and cubes as well (rollups and cubes are shorthand for a longer form GROUPING SET).

These concepts represent a new kind of "thing" for Ibis. They are neither column nor scalar expressions that can be evaluated, nor are they tables really.

The approach I took here was to add a non-Value operation for each construct. These are desugared into tuples of tuples of Value expressions (because multiple grouping sets specifications are allowed) wherever grouping sets are supported.

One tricky thing is that I had to partition the key specifically requested in group by, as distinct from those unique value expressions in grouping sets, to support constructs like this:

SELECT a, b
FROM t
GROUP BY ROLLUP (a, b)

which is not equivalent to

SELECT a, b
FROM t
GROUP BY a, b, ROLLUP (a, b)

Right now, I'm looking for feedback on the approach before fleshing out the test suite.

Issues closed

Aug 27 '24 16:08 cpcloud

This seems like a reasonable approach to me.
Small feedback that can always be addressed later:

I think it would be nice to allow passing in bare strings, values, or deferreds to grouping_sets, along with tuples of them, e.g.

ibis.grouping_sets(("a", "b"), "a", "b")

this would better mimic the SQL syntax (god, i can't believe I'm suggesting that...) and also avoids the awkward spelling of

ibis.grouping_sets(("a", "b"), ("a",), ("b",))

And on the naming front, I thing we should take a pointer from DuckDB and rename grouping to grouping_id -- I think grouping is going to cause confusion.

but overall seems solid!

Aug 27 '24 21:08 gforsyth

All good points, definitely changing grouping -> grouping_id. I'm half tempted to call it group_id, those extra three characters are killing me.

Aug 28 '24 01:08 cpcloud

For a full featured and correct implementation, grouping sets will depend on the earliest version of sqlglot contains https://github.com/tobymao/sqlglot/pull/3985.

Aug 28 '24 15:08 cpcloud

Ok, now that I think about this some more, I think having separate ibis.rollup/grouping_sets/cube objects adds unnecessary complication.

What do people think about making these keyword arguments to group_by? E.g.,

t.group_by(
    ...,
    rollup=(_.a, "b"),
    grouping_sets=("a", "b", ("a", "b")),
    cube=(_.c, _.d)
)

The separate objects would make sense if there were some way to use this functionality outside of a group by, but AFAICT they are strictly only valid in that case.

This would also eliminate having to do the partitioning, since the API requires the user spell it out.

Aug 28 '24 20:08 cpcloud

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

Aug 28 '24 20:08 cpcloud

What do people think about making these keyword arguments to group_by?

This would break users who having columns named rollup, cube, or grouping_sets but that seems okay?

I slightly dislike that. Not because it might break existing users (seems unlikely), but because it's harder to visually distinguish keyword-arguments serving as aliases in group_by from those specific kwargs.

t.group_by(one=t.a, cube=(t.b,), three=t.c)  # intentionally terrible formatting

Another potential downside is it doesn't let users as easily use multiple of these constructs in the same query. Following this line from the docs:

If multiple grouping items are specified in a single GROUP BY clause, then the final list of grouping sets is the cross product of the individual items.

With separate objects, I'd to be able to pass in multiple of the same kind of object, while kwargs would force the user to handle the composition of them themselves (confession - I'm new to this whole concept, so apologies if this is a dumb example):

t.group_by(ibis.cube("a", "b"), ibis.cube("c", "d"))

Individual objects also may be easier for programmatic usage - a system would only need a single list of things to group by, rather than splitting out the various grouping set kind of things.

None of these are blockers of course, and all can be worked around. Just noting a few potential downsides.

Aug 28 '24 20:08 jcrist

Hm, all good points. I think I'll stick with the objects for now.

Aug 28 '24 20:08 cpcloud

Keeping this as draft until I can at least add some less sophomoric tests.

Sep 05 '24 12:09 cpcloud

I wish I hadn't learned this while working through reviewing this, but I guess it's valid to pass a ROLLUP or a CUBE as an element of a GROUPING SET? I don't think that works right now, and maybe we leave that for a follow-up, but I thought we should note it down for ourselves.

Sep 20 '24 18:09 gforsyth

Which of course means you can nest grouping sets arbitrarily.

D create table t (a int, b int);
D select 1 from t group by grouping sets (rollup (a, b), a, grouping sets (cube (a, b)));
┌───────┐
│   1   │
│ int32 │
├───────┤
│     1 │
│     1 │
└───────┘

Sep 20 '24 18:09 cpcloud

This is a special level of hell.

Sep 20 '24 18:09 cpcloud

Another "fun" thing with these constructs is that some backends implement rollup/cube/grouping sets on an empty table such that the "grand total" group (usually spelled ()) return zero rows, and some return one row.

For example:

create table t (c int);
select * from t group by rollup (c);

Jan 02 '25 16:01 cpcloud

I think I'm going to cut this from the 10.0 milestone since it's not a breaking change and we need to get a release out.

Jan 22 '25 09:01 cpcloud

feat(api): support `rollup`, `cube` and `grouping_sets` APIs

Description of changes

Issues closed