dbplyr icon indicating copy to clipboard operation
dbplyr copied to clipboard

dplyr grouped distinct does not split-apply-combine calculations, but dbplyr does

Open machow opened this issue 3 years ago • 6 comments

Hello--it appears that dplyr's grouped distinct applies calculations to the whole frame, while dbplyr uses a partition to apply calculations within groups.

reprex using dplyr v1.0.10:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

df <- tibble(g = c("a", "a", "b"), x = c(1, 2, 3))
df %>% group_by(g) %>% distinct(avg = mean(x))
#> # A tibble: 2 × 2
#> # Groups:   g [2]
#>   g       avg
#>   <chr> <dbl>
#> 1 a         2
#> 2 b         2

memdb_frame(df) %>% group_by(g) %>% distinct(avg = mean(x, na.rm=TRUE)) %>% show_query()
#> <SQL>
#> SELECT DISTINCT `g`, AVG(`x`) OVER (PARTITION BY `g`) AS `avg`
#> FROM `dbplyr_001`

Created on 2022-10-06 by the reprex package (v2.0.1)

machow avatar Oct 06 '22 16:10 machow

This seems like a dplyr bug to me.

hadley avatar Oct 06 '22 21:10 hadley

There are 4 (ish) verbs that have data-masking semantics when I sort of feel they probably should have had tidy-select semantics, those are:

  • distinct()
  • count()
  • arrange()
  • group_by()

In all of those, you can use the ... to select variables and to create new ones on the fly. Also, in all of these we ungroup before applying the ..., which I think further solidifies that these maybe should be tidy-select only, because there was clearly some confusion there (at the very least, any change here would have to be applied to all 4 to be consistent).

The 99.9% case of all of these is to select existing variables to arrange/count by, which means users are just supplying bare symbols. The .1% case is supplying some kind of actual expression to create on the fly, which is where the question of whether or not to respect existing groups comes into play.

I imagine that in the past we always ungrouped for performance, like if the 99% case is just supplying bare symbols, we wouldn't want that to be slowed down by having a grouped data frame and having to evaluate the column per group. I think we have some optimizations for the bare symbol case now, so this wouldn't necessarily hold today.

Nevertheless, I think we are probably too far down the current path to change the behavior for all 4 of these functions now. And more importantly, with tidyverse/dplyr#6528 we are likely to see less grouped data frames floating around in general, which would mean this would come up less in practice anyways because verbs like distinct() and count() would be less likely to get a <grouped_df> as an input.

DavisVaughan avatar Nov 17 '22 17:11 DavisVaughan

I'm a bit confused. Doesn't across confer tidy-select semantics on data-masking functions? What's the rationale for giving something tidy-select only semantics (if I understand, this also means prohibiting data-masking)?

Is the idea that people want to tidy-select with these functions more often than data-mask?

FWIW DRob's corpus of analyses uses data-masking in group_by often, and it's mentioned in his 10 tremendous tricks of the tidyverse talk. I could see how tidy-select in group_by makes life easier for a whole other set of activities though (by not requiring across), just trying to wrap my head around this!

machow avatar Nov 18 '22 16:11 machow

Is the idea that people want to tidy-select with these functions more often than data-mask?

Yes definitely this. Making these functions data-masking (i.e. allowing pretty much any arbitrary expression) makes them extremely complicated with a lot of edge case bugs that have popped up over the years - especially related to what to do when the input is already a grouped-df.

A tidy-select interface would be much simpler and more robust, and would capture the 99% case of each of these functions, and I think there is something pretty powerful about that

DavisVaughan avatar Nov 18 '22 16:11 DavisVaughan

To be clear, we're not proposing to change the syntax, we're just looking back at the design choices we made in dplyr and wishing we'd done it differently.

But overall that nets out to changing this behaviour in dbplyr, not dplyr.

hadley avatar Dec 11 '22 21:12 hadley

Moved to dbplyr.

hadley avatar Dec 15 '22 20:12 hadley