ibis icon indicating copy to clipboard operation
ibis copied to clipboard

ux: what are the semantics of window functions that occur after a group by?

Open wleescor opened this issue 2 years ago • 2 comments

What happened?

Not sure if this is a bug or I'm not understanding window functions properly, but I thought that when you group by and then mutate, it will allow you to use window functions (without ibis.window and without grouping again). Here is an example:

test = ibis.memtable({'a': [1, 2, 2, 3],
                          'b': [4, 5, 6, 7]})
test \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(n=ibis.row_number()) \
.mutate(c = _.n.last())

The output for column c did not get the last n column for each a group, but only the global last n

here is how I solve this (but I thought it would automatically be applying the window function:

test \
.group_by('a') \
.order_by(['a', 'b']) \
.mutate(n=ibis.row_number()) \
.mutate(c = _.n.last().over(ibis.window(group_by='a')))

or I could do this

test \
    .group_by('a') \
    .order_by(['a', 'b']) \
    .mutate(n=ibis.row_number()) \
    .group_by('a') \
    .order_by(['a', 'b']) \
    .mutate(c = _.n.last())

so maybe I'm answering my own question in that the groupby / mutate method allows window functions only after the first mutate. After that you need to group_by again or apply ibis.window? Maybe a feature request is to retain the group_by for subsequent mutates (which seems more intuitive, rather than this silent error from misunderstanding)

What version of ibis are you using?

7.0

What backend(s) are you using, if any?

DuckDB

Relevant log output

initial output - the last mutate to get column c is not what i expect
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃ n     ┃ c     ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ int64 │ int64 │
├───────┼───────┼───────┼───────┤
│     2 │     5 │     0 │     0 │
│     2 │     6 │     1 │     0 │
│     1 │     4 │     0 │     0 │
│     3 │     7 │     0 │     0 │
└───────┴───────┴───────┴───────┘

Expecting this output (which is solved by using ibis.window, or the other technique of grouping by a again)
┏━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━┓
┃ a     ┃ b     ┃ n     ┃ c     ┃
┡━━━━━━━╇━━━━━━━╇━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │ int64 │ int64 │
├───────┼───────┼───────┼───────┤
│     2 │     5 │     0 │     1 │
│     2 │     6 │     1 │     1 │
│     1 │     4 │     0 │     0 │
│     3 │     7 │     0 │     0 │
└───────┴───────┴───────┴───────┘

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

wleescor avatar Dec 20 '23 13:12 wleescor

Hi @wleescor 👋🏻!

Thanks for opening an issue about this.

I agree that it's surprising to only have the first group/order chain apply to subsequent window functions.

There's at least one unanswered question about how to address this issue in a way that makes sense with other use cases:

What is the behavior in cases where there's a group by + a window function call that contains a group by in the over call? Are those grouping keys combined? Does the latter take precedence over the former? Same question for order by as well.

These cases are expressions like:

t.group_by("a").mutate(x=_.col.sum().over(group_by="b"))  # is the sum partitioned by a, b?

cpcloud avatar Dec 20 '23 13:12 cpcloud

thanks for your answer! I would think this is a bit like local scope vs global scope, where local scope gets priority over global scope. so in this example, sum is partitioned by b as 'local scope'

maybe we need an 'ungroup' method to make clear when we group and ungroup?

wleescor avatar Dec 31 '23 04:12 wleescor