Daft Global Expressions: improved Aggregation syntax

Is your feature request related to a problem? Please describe.

Currently, Daft aggregation syntax is a little loose and is modelled after PyArrow.

df.agg([(col("a"), "agg-string")])

This has a few issues:

Hard to document - I believe we actually don't have documentation anywhere at the moment about the full list of aggregation options available here.
Hard/unintuitive to assign new names to the aggregated column: currently users have to do (col("a").alias("a_count"), "count") in order to avoid errors if grouping by column "a". Not doing so yields a pretty confusing error message: Expressions must all have unique names; saw a twice
Misspelling the aggregation string also yields a confusing error: NotImplementedError: LogicalPlan construction for operation not implemented: foo. A full list of options is not presented to the user.

Proposal

# Syntax: df.agg(<aggregated_colname>=<agg_expression>)
df.agg(
    a_count=col("a").agg.count(),
    b_min=col("b").agg.min(),
)

Documentation can be under the Expression.agg namespace
Users assign a new name to the aggregated column using the kwarg's name
No misspelling the aggregation string, since they are now method names and not arbitary strings

Jun 05 '23 13:06 jaychia

cc @@skrawcz as well

Jan 23 '24 21:01 jaychia

@jaychia for rolling/window support you could build an API like this:

df.agg( # if rows are weeks...
    a_3wk_rolling_mean=col("a").agg.mean().over(3),
    a_7wk_rolling_mean=col("a").agg.mean().over(7),
)

Related to aggregations -- being able to use them in an expression would be nice, so something like:

df.select(
   a=col("a"), 
   a_zero_mean=(col("a") - col("a").agg.mean())
) # or with cols, and then a select...

Jan 23 '24 22:01 skrawcz

@kevinzwang @jaychia Do we think we can close this now?

Aug 29 '24 00:08 samster25

Yep, this has been resolved by #2000. Closing now!

Aug 29 '24 01:08 kevinzwang