RFC: Make it easier to call window functions via expression API (and add example)
before polishing this PR up I wanted to get some feedback on the approach
Which issue does this PR close?
Related to https://github.com/apache/arrow-datafusion/issues/5781 Closes: https://github.com/apache/arrow-datafusion/issues/6747
Rationale for this change
@mustafasrepo suggested adding an example for DataFrame::window here: https://github.com/apache/arrow-datafusion/pull/6703#discussion_r1238403292
When I tried to do so it turns out that creating an Expr::Window is quite challenging, so I took a step back and tried to clean up the API a little
What changes are included in this PR?
- Change
WindowExprto be a builder style interface and thus making it easier to constructExpr::WindowFunctionvariants. This is modeled after theCaseBuilder - Add the
lead,lag, etc. inexpr_fnsto follow the model of the other types of functions
Are these changes tested?
Yes
TOOD: I need to write tests for all the code in expr_fns which I will add to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/dataframe_functions.rs if people like this basic approach
Are there any user-facing changes?
yes, the APIs to create window functions are improved
As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the WindowFunction directly?
let single_lead = lead(col("mycol"));
let two_lead_with_default = expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![col("mycol"), lit(2), lit(mydefault)]);
As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the
WindowFunctiondirectly?
I think it would look something like this
let single_lead = lead(col("mycol"))
.
let two_lead_with_default = lead.call(col("mycol"))
.offset(lit(2))
.partition_by(lit(mydefault)]))
.build()
I'm looking at the arguments we end up passing. I think cume_dist takes no arguments. Lead and lag take anywhere between 1 and 3. nth_value takes two. I think it might be easier to use with something like
pub fn lead(arg: Expr, offset: Option<i64>, default: Option<ScalarValue>) -> expr::WindowFunction {
let offset = lit(ScalarValue::Int64(offset));
let default_value = match default {
Some(v) => lit(v),
None => lit(ScalarValue::Null),
};
expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![arg, offset, default_value])
}
and for nth value it's easier
pub fn nth_value(arg: Expr, n: i64) -> expr::WindowFunction {
expr::WindowFunction::new(BuiltInWindowFunction::NthValue, vec![arg, lit(n)])
}
I think it might be easier to use with something like
That makes sense to me @timsaucer -- thank you for the feedback.
I don't think I will have time to work on this PR / issue for a while, but I tried to give it some visibility in https://github.com/apache/datafusion/issues/10395 and hopefully we can find someone else who is interested in helping make it work
(or of course if you have the time it would be an amazing contribution 🙏 )
I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.
I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.
Thank you very much 🙏
I totally understand re time constraints. Maybe we'll find some other people to help too. One can hope
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.
We can revive this PR / its API when someone has time to work on it
In case anyone is following along, @jayzhan211 added a really nice trait for working with aggregate functions. Maybe we can do something similar for window functions eventually
https://github.com/apache/datafusion/blob/e693ed7a3c3b36405f0a34887e6f8b49d4e97152/datafusion/expr/src/udaf.rs#L614-L653
I should have time to work on this near the end of the week or early next week. I'm in nearing the end of a large PR for datafusion-python and can dive back on this topic in once that's up for review.
Thanks @timsaucer -- looking forward to it.
I think the most useful part of this PR are the doc examples