dplyr `.by` argument as alternative to `group

`.by` argument as alternative to `group_by`

Open markfairbanks opened this issue 2 years ago • 19 comments

Edit: Updated to reflect @DavisVaughan's suggestion here.

Any thoughts on implementing a .by arg so that functions can operate by group without returning a grouped_df?

Basically this:

df <- tibble(x = 1:3, y = c("a", "a", "b"))

df %>%
  mutate(pct = x/sum(x), .by = y)

would be equivalent to this:

df %>%
  group_by(y) %>%
  mutate(pct = x/sum(x)) %>%
  ungroup()

Mar 16 '22 14:03 markfairbanks

I have a feeling this would be in the verb itself if we did this, like:

df %>%
  mutate(pct = x/sum(x), .groups = y)

Mar 16 '22 15:03 DavisVaughan

Makes sense - I'll update the original request.

Mar 16 '22 15:03 markfairbanks

One thing to note - I think there would have to be a different name because summarize() already has a .groups arg. Or maybe since it's experimental it can be repurposed?

Mar 16 '22 15:03 markfairbanks

Isn't this with_groups() ?

Mar 21 '22 09:03 romainfrancois

I think a .groups arg would be much more intuitive to use. If implemented it would probably supersede with_groups().

Mar 21 '22 12:03 markfairbanks

Alternative argument name would be .by

Mar 21 '22 13:03 DavisVaughan

Perhaps also a .by/.groups argument for filter()? I use filter() after group_by() quite often, when I need to keep/discard entire groups. For example:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

iris %>%
  group_by(Species) %>%
  filter(mean(Petal.Length) >= 2)
#> # A tibble: 100 × 5
#> # Groups:   Species [2]
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>     
#>  1          7           3.2          4.7         1.4 versicolor
#>  2          6.4         3.2          4.5         1.5 versicolor
#>  3          6.9         3.1          4.9         1.5 versicolor
#>  4          5.5         2.3          4           1.3 versicolor
#>  5          6.5         2.8          4.6         1.5 versicolor
#>  6          5.7         2.8          4.5         1.3 versicolor
#>  7          6.3         3.3          4.7         1.6 versicolor
#>  8          4.9         2.4          3.3         1   versicolor
#>  9          6.6         2.9          4.6         1.3 versicolor
#> 10          5.2         2.7          3.9         1.4 versicolor
#> # … with 90 more rows

^{Created on 2022-03-21 by the reprex package (v2.0.0)}

Mar 21 '22 14:03 zenggyu

In my brain, if we were to add it to one of the major verbs, then we would add it to all of them

(Oops accidentally closed sorry)

Mar 21 '22 14:03 DavisVaughan

We've discussed this idea a number of times (I think we'd call it .by) but it would be a major change to the way that dplyr works.

(One other nice feature is that the presence of this argument would clearly advertise which functions are affected by grouping; select() etc would lack it.)

Apr 15 '22 17:04 hadley

@hadley do you have any idea how it would support rowwise? Maybe it wouldn’t? It’s a little tricky because .by would either be NULL or a tidyselect expression of columns to group by, so it feels like it either wouldn’t support it or we’d have to special case some expression like .by = .row (which may be hard to get right in all cases, not sure)

We’d always be able to recommend an explicit rowwise() call if we decide it shouldn’t support it

Apr 15 '22 21:04 DavisVaughan

@DavisVaughan yeah, we'd either need some special sentinel, or declare it's out of scope.

Apr 15 '22 23:04 hadley

An additional .rowwise = FALSE arg that cannot be TRUE if .by is set would work too.

FWIW I would love the feature, I'm seeing clients use a lot of code with grouped tibbles that should be ungrouped but they don't bother because they'll regroup it before their next summarise call. In consequence the mutate and filter calls are dangerous and inefficient and the code is cluttered with messages about those groupings, that are unfortunately not acted upon. It would be so much more compact to do simple aggregations for data exploration too.

Apr 29 '22 11:04 moodymudskipper

@moodymudskipper supporting different forms of grouping by adding extra arguments to every dplyr verb is not a viable strategy.

Apr 29 '22 12:04 hadley

I'm also not sure how to support dplyr::group_by(.drop = FALSE) with .by. Like this? Adding .drop everywhere feels very clunky.

df %>% summarise(avg = mean(x), .by = fct, .drop = FALSE)

How useful is .drop?

Jun 27 '22 19:06 DavisVaughan

I'd like to suggest using .group_by instead of .by.

You're not "mutating by" or "summarizing by". You're "grouping by".
It's consistent with the group_by() name, so it'd be familiar to people who already use it.
Some may conflate the .by for joins with this argument.

One change to consider with this approach is that it takes the sequential calling style that's pretty central to tidyverse style and makes it nested. Maybe coding style can alleviate that issue though?

data %>%
    group_by(var1, var2, subject_id) %>%
    summarize(mu = mean(x), sigma = sd(x)) %>%
    ungroup()

versus

data %>%
    summarize(
        .group_by = c(var1, var2, subject_id),
        mu = mean(x), sigma = sd(x)
    )

Jul 15 '22 14:07 steveharoz

Also, what about autocomplete in RStudio? Will it be able to autocomplete column names for the .by parameter like it does for group_by()

Autocomplete is pretty critical for those of us who can never remember the names of columns or how to spell them :)

Jul 15 '22 14:07 steveharoz

I'm not sure what is so bad about

data %>%
  summarize(
    mu = mean(x), 
    sigma = sd(x),
    .group_by = c(var1, var2, subject_id)
  )

I feel like it nicely encapsulates the full operation, and doesn't look overly nested to me. You aren't adding any more nesting than what was already there from the actual summary expressions

Jul 15 '22 14:07 DavisVaughan

@DavisVaughan I agree, but even though I've been using R for 6+ years, I still register c( ) as a function call. That's why I think of it as nested. But that's probably tangential to this threat ;)

Jul 15 '22 14:07 steveharoz

Oh I didn't realize you meant that c() was the extra level of nesting. Yea that looks "flat" to me, not as another layer of nesting.

Also FWIW the IDE will still autocomplete variable names if you pipe into it with this setup

I've opened an RStudio issue to see if we can get autocomplete for the .by argument itself, which probably wouldn't be part of the generic itself (like .keep) https://github.com/rstudio/rstudio/issues/11627

Jul 15 '22 14:07 DavisVaughan

dplyr dplyr copied to clipboard

`.by` argument as alternative to `group_by`

dplyr
dplyr copied to clipboard