dplyr
dplyr copied to clipboard
`.by` argument as alternative to `group_by`
Edit: Updated to reflect @DavisVaughan's suggestion here.
Any thoughts on implementing a .by
arg so that functions can operate by group without returning a grouped_df
?
Basically this:
df <- tibble(x = 1:3, y = c("a", "a", "b"))
df %>%
mutate(pct = x/sum(x), .by = y)
would be equivalent to this:
df %>%
group_by(y) %>%
mutate(pct = x/sum(x)) %>%
ungroup()
I have a feeling this would be in the verb itself if we did this, like:
df %>%
mutate(pct = x/sum(x), .groups = y)
Makes sense - I'll update the original request.
One thing to note - I think there would have to be a different name because summarize()
already has a .groups
arg. Or maybe since it's experimental it can be repurposed?
Isn't this with_groups() ?
I think a .groups
arg would be much more intuitive to use. If implemented it would probably supersede with_groups()
.
Alternative argument name would be .by
Perhaps also a .by
/.groups
argument for filter()
? I use filter()
after group_by()
quite often, when I need to keep/discard entire groups. For example:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
iris %>%
group_by(Species) %>%
filter(mean(Petal.Length) >= 2)
#> # A tibble: 100 × 5
#> # Groups: Species [2]
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 7 3.2 4.7 1.4 versicolor
#> 2 6.4 3.2 4.5 1.5 versicolor
#> 3 6.9 3.1 4.9 1.5 versicolor
#> 4 5.5 2.3 4 1.3 versicolor
#> 5 6.5 2.8 4.6 1.5 versicolor
#> 6 5.7 2.8 4.5 1.3 versicolor
#> 7 6.3 3.3 4.7 1.6 versicolor
#> 8 4.9 2.4 3.3 1 versicolor
#> 9 6.6 2.9 4.6 1.3 versicolor
#> 10 5.2 2.7 3.9 1.4 versicolor
#> # … with 90 more rows
Created on 2022-03-21 by the reprex package (v2.0.0)
In my brain, if we were to add it to one of the major verbs, then we would add it to all of them
(Oops accidentally closed sorry)
We've discussed this idea a number of times (I think we'd call it .by
) but it would be a major change to the way that dplyr works.
(One other nice feature is that the presence of this argument would clearly advertise which functions are affected by grouping; select()
etc would lack it.)
@hadley do you have any idea how it would support rowwise? Maybe it wouldn’t? It’s a little tricky because .by
would either be NULL
or a tidyselect expression of columns to group by, so it feels like it either wouldn’t support it or we’d have to special case some expression like .by = .row
(which may be hard to get right in all cases, not sure)
We’d always be able to recommend an explicit rowwise()
call if we decide it shouldn’t support it
@DavisVaughan yeah, we'd either need some special sentinel, or declare it's out of scope.
An additional .rowwise = FALSE
arg that cannot be TRUE
if .by
is set would work too.
FWIW I would love the feature, I'm seeing clients use a lot of code with grouped tibbles that should be ungrouped but they don't bother because they'll regroup it before their next summarise call. In consequence the mutate and filter calls are dangerous and inefficient and the code is cluttered with messages about those groupings, that are unfortunately not acted upon. It would be so much more compact to do simple aggregations for data exploration too.
@moodymudskipper supporting different forms of grouping by adding extra arguments to every dplyr verb is not a viable strategy.
I'm also not sure how to support dplyr::group_by(.drop = FALSE)
with .by
. Like this? Adding .drop
everywhere feels very clunky.
df %>% summarise(avg = mean(x), .by = fct, .drop = FALSE)
How useful is .drop
?
I'd like to suggest using .group_by
instead of .by
.
- You're not "mutating by" or "summarizing by". You're "grouping by".
- It's consistent with the
group_by()
name, so it'd be familiar to people who already use it. - Some may conflate the
.by
for joins with this argument.
One change to consider with this approach is that it takes the sequential calling style that's pretty central to tidyverse style and makes it nested. Maybe coding style can alleviate that issue though?
data %>%
group_by(var1, var2, subject_id) %>%
summarize(mu = mean(x), sigma = sd(x)) %>%
ungroup()
versus
data %>%
summarize(
.group_by = c(var1, var2, subject_id),
mu = mean(x), sigma = sd(x)
)
Also, what about autocomplete in RStudio? Will it be able to autocomplete column names for the .by
parameter like it does for group_by()
Autocomplete is pretty critical for those of us who can never remember the names of columns or how to spell them :)
I'm not sure what is so bad about
data %>%
summarize(
mu = mean(x),
sigma = sd(x),
.group_by = c(var1, var2, subject_id)
)
I feel like it nicely encapsulates the full operation, and doesn't look overly nested to me. You aren't adding any more nesting than what was already there from the actual summary expressions
@DavisVaughan I agree, but even though I've been using R for 6+ years, I still register c( )
as a function call. That's why I think of it as nested. But that's probably tangential to this threat ;)
Oh I didn't realize you meant that c()
was the extra level of nesting. Yea that looks "flat" to me, not as another layer of nesting.
Also FWIW the IDE will still autocomplete variable names if you pipe into it with this setup
I've opened an RStudio issue to see if we can get autocomplete for the .by
argument itself, which probably wouldn't be part of the generic itself (like .keep
) https://github.com/rstudio/rstudio/issues/11627