Daft
Daft copied to clipboard
Global Expressions: improved Aggregation syntax
Is your feature request related to a problem? Please describe.
Currently, Daft aggregation syntax is a little loose and is modelled after PyArrow.
df.agg([(col("a"), "agg-string")])
This has a few issues:
- Hard to document - I believe we actually don't have documentation anywhere at the moment about the full list of aggregation options available here.
- Hard/unintuitive to assign new names to the aggregated column: currently users have to do
(col("a").alias("a_count"), "count")in order to avoid errors if grouping by column "a". Not doing so yields a pretty confusing error message:Expressions must all have unique names; saw a twice - Misspelling the aggregation string also yields a confusing error:
NotImplementedError: LogicalPlan construction for operation not implemented: foo. A full list of options is not presented to the user.
Proposal
# Syntax: df.agg(<aggregated_colname>=<agg_expression>)
df.agg(
a_count=col("a").agg.count(),
b_min=col("b").agg.min(),
)
- Documentation can be under the
Expression.aggnamespace - Users assign a new name to the aggregated column using the kwarg's name
- No misspelling the aggregation string, since they are now method names and not arbitary strings
cc @@skrawcz as well
@jaychia for rolling/window support you could build an API like this:
df.agg( # if rows are weeks...
a_3wk_rolling_mean=col("a").agg.mean().over(3),
a_7wk_rolling_mean=col("a").agg.mean().over(7),
)
Related to aggregations -- being able to use them in an expression would be nice, so something like:
df.select(
a=col("a"),
a_zero_mean=(col("a") - col("a").agg.mean())
) # or with cols, and then a select...
@kevinzwang @jaychia Do we think we can close this now?
Yep, this has been resolved by #2000. Closing now!