ux: what are the semantics of window functions that occur after a group by?
What happened?
Not sure if this is a bug or I'm not understanding window functions properly, but I thought that when you group by and then mutate, it will allow you to use window functions (without ibis.window and without grouping again). Here is an example:
test = ibis.memtable({'a': [1, 2, 2, 3],
'b': [4, 5, 6, 7]})
test \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(n=ibis.row_number()) \
.mutate(c = _.n.last())
The output for column c did not get the last n column for each a group, but only the global last n
here is how I solve this (but I thought it would automatically be applying the window function:
test \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(n=ibis.row_number()) \
.mutate(c = _.n.last().over(ibis.window(group_by='a')))
or I could do this
test \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(n=ibis.row_number()) \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(c = _.n.last())
so maybe I'm answering my own question in that the groupby / mutate method allows window functions only after the first mutate. After that you need to group_by again or apply ibis.window? Maybe a feature request is to retain the group_by for subsequent mutates (which seems more intuitive, rather than this silent error from misunderstanding)
What version of ibis are you using?
7.0
What backend(s) are you using, if any?
DuckDB
Relevant log output
initial output - the last mutate to get column c is not what i expect
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ a ┃ b ┃ n ┃ c ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ int64 │ int64 │
├───────┼───────┼───────┼───────┤
│ 2 │ 5 │ 0 │ 0 │
│ 2 │ 6 │ 1 │ 0 │
│ 1 │ 4 │ 0 │ 0 │
│ 3 │ 7 │ 0 │ 0 │
└───────┴───────┴───────┴───────┘
Expecting this output (which is solved by using ibis.window, or the other technique of grouping by a again)
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ a ┃ b ┃ n ┃ c ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ int64 │ int64 │
├───────┼───────┼───────┼───────┤
│ 2 │ 5 │ 0 │ 1 │
│ 2 │ 6 │ 1 │ 1 │
│ 1 │ 4 │ 0 │ 0 │
│ 3 │ 7 │ 0 │ 0 │
└───────┴───────┴───────┴───────┘
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Hi @wleescor 👋🏻!
Thanks for opening an issue about this.
I agree that it's surprising to only have the first group/order chain apply to subsequent window functions.
There's at least one unanswered question about how to address this issue in a way that makes sense with other use cases:
What is the behavior in cases where there's a group by + a window function call that contains a group by in the over call? Are those grouping keys combined? Does the latter take precedence over the former? Same question for order by as well.
These cases are expressions like:
t.group_by("a").mutate(x=_.col.sum().over(group_by="b")) # is the sum partitioned by a, b?
thanks for your answer! I would think this is a bit like local scope vs global scope, where local scope gets priority over global scope. so in this example, sum is partitioned by b as 'local scope'
maybe we need an 'ungroup' method to make clear when we group and ungroup?