datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

RFC: Make it easier to call window functions via expression API (and add example)

Open alamb opened this issue 2 years ago • 7 comments

before polishing this PR up I wanted to get some feedback on the approach

Which issue does this PR close?

Related to https://github.com/apache/arrow-datafusion/issues/5781 Closes: https://github.com/apache/arrow-datafusion/issues/6747

Rationale for this change

@mustafasrepo suggested adding an example for DataFrame::window here: https://github.com/apache/arrow-datafusion/pull/6703#discussion_r1238403292

When I tried to do so it turns out that creating an Expr::Window is quite challenging, so I took a step back and tried to clean up the API a little

What changes are included in this PR?

  1. Change WindowExpr to be a builder style interface and thus making it easier to construct Expr::WindowFunction variants. This is modeled after the CaseBuilder
  2. Add the lead, lag, etc. in expr_fns to follow the model of the other types of functions

Are these changes tested?

Yes

TOOD: I need to write tests for all the code in expr_fns which I will add to https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/tests/dataframe/dataframe_functions.rs if people like this basic approach

Are there any user-facing changes?

yes, the APIs to create window functions are improved

alamb avatar Jun 22 '23 17:06 alamb

As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the WindowFunction directly?

let single_lead = lead(col("mycol"));
let two_lead_with_default = expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![col("mycol"), lit(2), lit(mydefault)]);

timsaucer avatar May 02 '24 13:05 timsaucer

As a new user, I like this change very much. I found creating lead and lag functions to be difficult to use. Suppose I wanted to call lead but provide the other two optional arguments - shift offset and default value. Would the best way to do that with this proposal to create the WindowFunction directly?

I think it would look something like this

let single_lead = lead(col("mycol"))
  .
let two_lead_with_default = lead.call(col("mycol"))
  .offset(lit(2))
  .partition_by(lit(mydefault)]))
  .build()

alamb avatar May 03 '24 09:05 alamb

I'm looking at the arguments we end up passing. I think cume_dist takes no arguments. Lead and lag take anywhere between 1 and 3. nth_value takes two. I think it might be easier to use with something like

pub fn lead(arg: Expr, offset: Option<i64>, default: Option<ScalarValue>) -> expr::WindowFunction {
    let offset = lit(ScalarValue::Int64(offset));
    let default_value = match default {
        Some(v) => lit(v),
        None => lit(ScalarValue::Null),
    };
    expr::WindowFunction::new(BuiltInWindowFunction::Lead, vec![arg, offset, default_value])
}

and for nth value it's easier

pub fn nth_value(arg: Expr, n: i64) -> expr::WindowFunction {
    expr::WindowFunction::new(BuiltInWindowFunction::NthValue, vec![arg, lit(n)])
}

timsaucer avatar May 03 '24 12:05 timsaucer

I think it might be easier to use with something like

That makes sense to me @timsaucer -- thank you for the feedback.

I don't think I will have time to work on this PR / issue for a while, but I tried to give it some visibility in https://github.com/apache/datafusion/issues/10395 and hopefully we can find someone else who is interested in helping make it work

alamb avatar May 06 '24 16:05 alamb

(or of course if you have the time it would be an amazing contribution 🙏 )

alamb avatar May 06 '24 16:05 alamb

I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.

timsaucer avatar May 07 '24 16:05 timsaucer

I'm willing to work on this, but I'd like to wrap up the examples I have going into datafusion-python first. I think those will have more impact. I only have an hour or so a day to work on this, so I can't promise how fast I'll get to it.

Thank you very much 🙏

I totally understand re time constraints. Maybe we'll find some other people to help too. One can hope

alamb avatar May 07 '24 19:05 alamb

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Jul 07 '24 01:07 github-actions[bot]

We can revive this PR / its API when someone has time to work on it

In case anyone is following along, @jayzhan211 added a really nice trait for working with aggregate functions. Maybe we can do something similar for window functions eventually

https://github.com/apache/datafusion/blob/e693ed7a3c3b36405f0a34887e6f8b49d4e97152/datafusion/expr/src/udaf.rs#L614-L653

alamb avatar Jul 07 '24 11:07 alamb

I should have time to work on this near the end of the week or early next week. I'm in nearing the end of a large PR for datafusion-python and can dive back on this topic in once that's up for review.

timsaucer avatar Jul 08 '24 12:07 timsaucer

Thanks @timsaucer -- looking forward to it.

I think the most useful part of this PR are the doc examples

alamb avatar Jul 09 '24 21:07 alamb