dplyr
dplyr copied to clipboard
Implement `mutate(.when =)`
Closes #4050 Closes #6304
This is a fully tested implementation of mutate(.when =)
.
Implementation details
A few notes on how it works:
-
.when
must evaluate to a logical vector the same size as.data
. It isn't recycled. -
Groups are ignored when computing
.when
. This might be a little controversial, but I think it makes the most sense:- It matches how SQL's WHERE statement works when combined with GROUP BY. The WHERE is applied first and then the groups are computed on the remaining rows. This will be clearer if we have
mutate(df, .when =, .by =)
. - You generally don't want
.when
to be evaluated per group (it rarely needs a per-group mean or something like that), but sometimes you want your expressions in...
to still be evaluated per group after applying a global.when
. You save a lot of performance in this case by evaluating.when
on the ungrouped data. - If you really need per-group
.when
, you should just useif_else()
instead, since expressions in the...
are evaluated per group.
- It matches how SQL's WHERE statement works when combined with GROUP BY. The WHERE is applied first and then the groups are computed on the remaining rows. This will be clearer if we have
-
It is hooked into the data mask to be performant. Only columns that are referenced in
...
are sliced to the locations referred to by.when
. -
.when
is mainly useful for updating existing columns. Because of this, you can't modify the type of the columns you are updating. i.e. ifx
is an integer column then you can't domutate(df, x = x + 1.5, .when = y > 2)
.- This type stability is an extremely nice property. It makes it more valuable than
if_else()
for updates, because that takes the common type, i.e.x = if_else(y > 2, x + 1.5, x)
would not be type stable onx
.
- This type stability is an extremely nice property. It makes it more valuable than
Outstanding questions
-
Should
.when
allowif_any()
andif_all()
? It seems like they might be useful. Right now it requires.when
to evaluate to a single logical vector. I don't think it should allowacross()
though. -
Are we ok with this single condition interface? I am. We had a lot of discussion about alternative interfaces that might allow case-when style updates like
mutate(when(x < 2, x = NA, y = 3), when(x < 5, x = 99))
, but I think:- That is overly complicated for the 99% case that this is likely to be used for (1 condition, >=1 columns to update)
- It introduces a clunky
when()
helper that wouldn't be used anywhere else - It opens up a lot of hard to answer questions like: Can you mix
when()
and normal expressions in the samemutate()
call? Do you have to recompute groups before eachwhen()
? - Keeping it to 1 condition should allow us to nicely translate this to a SQL
UPDATE
statement for dbplyr, I think.
Outstanding actions
- [ ] Export
vec_locate_runs()
from vctrs if we end up using it here in the final version of this PR
Examples
library(dplyr)
# From issue 4050
df <- tibble(a = 1:5)
mutate(df, a = a + 1, .when = a > 3)
#> # A tibble: 5 × 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6
df <- tibble(x = 1:4, y = 5:8)
mutate(df, x = -x, y = -y, .when = x < 4)
#> # A tibble: 4 × 2
#> x y
#> <int> <int>
#> 1 -1 -5
#> 2 -2 -6
#> 3 -3 -7
#> 4 4 8
# More practical. Replace missing values.
df <- tibble(x = c(1, 2, NA, 4, NA))
mutate(df, x = 0, .when = is.na(x))
#> # A tibble: 5 × 1
#> x
#> <dbl>
#> 1 1
#> 2 2
#> 3 0
#> 4 4
#> 5 0
# Standardize missings across variables
df <- tibble(x = c(1, NA, 2, 4), y = c(1, 2, NA, 4), z = c(NA, "b", "c", "d"))
mutate(df, x = NA, y = NA, z = NA, .when = is.na(x) | is.na(y) | is.na(z))
#> # A tibble: 4 × 3
#> x y z
#> <dbl> <dbl> <chr>
#> 1 NA NA <NA>
#> 2 NA NA <NA>
#> 3 NA NA <NA>
#> 4 4 4 d
Performance
I think performance is generally pretty good. My bar was mainly to be faster than what you could do with an if_else()
. It is a lot more compact than an if_else()
too.
library(nycflights13)
library(vctrs)
library(dplyr)
big_flights <- vec_rep(flights, 100)
big_grouped_flights <- group_by(big_flights, year, month)
# Small data, one condition, no groups
bench::mark(
slow = mutate(
flights,
dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay)
),
fast = mutate(
flights,
dep_delay = -dep_delay,
.when = dep_time > 500
),
iterations = 5
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slow 25.3ms 25.5ms 39.0 41.7MB 0
#> 2 fast 12.5ms 12.8ms 78.4 14.1MB 0
# Small data, two conditions, no groups
bench::mark(
slow = mutate(
flights,
dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
),
fast = mutate(
flights,
dep_delay = -dep_delay,
sched_dep_time = -sched_dep_time,
.when = dep_time > 500
),
iterations = 5
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slow 25.7ms 36.2ms 27.9 72.1MB 0
#> 2 fast 13.9ms 15ms 67.7 17.8MB 0
# Big data, two conditions, no groups
bench::mark(
slow = mutate(
big_flights,
dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
),
fast = mutate(
big_flights,
dep_delay = -dep_delay,
sched_dep_time = -sched_dep_time,
.when = dep_time > 500
),
iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slow 2.54s 2.72s 0.363 7.04GB 1.31
#> 2 fast 806.87ms 972.97ms 1.01 1.73GB 0.805
# Big data, two conditions, with groups
bench::mark(
slow = mutate(
big_grouped_flights,
dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
),
fast = mutate(
big_grouped_flights,
dep_delay = -dep_delay,
sched_dep_time = -sched_dep_time,
.when = dep_time > 500
),
iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 slow 5s 5.45s 0.183 7.92GB 0.623
#> 2 fast 2.33s 2.86s 0.339 3.71GB 0.679
Random lurker here, but just my general thoughts. Having the ability to apply .when
non-globally would be the best drop in replacement for if_else()
IMO. I understand that probably isn't the overall intent of this PR but would be much more useful in cases where there are many conditions and writing a global condition would be cumbersome or unreadable.
As I understand the alternative with the current API proposed is to have multiple calls to mutate()
each with a separate condition. I think that could be a good performance benchmark to do.
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
)
# Example of .when non-global
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0: .when = is.na(a),
b = 'pick-up': .when = b == "truck",
c = c * 10: .when = c <100)
# Current API ?
# Can get quite cumbersome with many conditions
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0,
b = 'pick-up',
c = c * 10,
.when = is.na(a)|b == "truck"|c<100)
# Compare performance to single mutate with multiple if_else statements
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "truck", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
Your third example with .when = is.na(a)|b == "truck"|c<100
does something very different from the other three, right?
Your third example with
.when = is.na(a)|b == "truck"|c<100
does something very different from the other three, right?
Ah I see, I misunderstood your example at first. Though, it would be interesting the see the performance comparison between these.
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
)
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "truck", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
If you sample that up to 1 million rows (to actually get useful benchmarks) then it is pretty clear that repeated mutates are much faster than if_else()
(which I expected). The cost to set up the mutate()
call isn't that high.
I also think the .when =
approach looks much cleaner
library(dplyr)
df <- tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)
# sample up to 1 mil rows
df <- tibble::new_tibble(lapply(df, sample, size = 1e6, replace = TRUE))
bench::mark(
ifelse = df |>
mutate(
a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
),
when = df |>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "pick-up", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 ifelse 167.3ms 235ms 4.37 237.9MB 18.9
#> 2 when 48.7ms 57.6ms 15.6 55.1MB 13.7
Groups are ignored when computing
.when
I'm sure there was a lot of discussion around this, but I am going to throw in a vote for this being confusing. The idea that it matches SQL seems less relevant than the fact that filter
and mutate
both work by group. And having an argument inside mutate
that mimics filter
/if_else
/case_when
conditions but doesn't work by group is a bit odd (IMO).
The performance loss seems worth it to keep things consistent with other grouping behaviors.
I really like the rest of the API btw. Seems very useful.
@markfairbanks do you think that this should apply the .when
expression before or after groups are computed?
df %>% mutate(x = mean(y), .when = is.na(x), .by = g)
For reference, equivalent data.table syntax applies .when
(i
) first on all of df
and then computes the groups.
I was optimizing for this potential syntax, assuming that .when
would be computed before .by
. But maybe that would be unexpected there too?
I did just learn that apparently you can't combine UPDATE
with GROUP BY
in SQL, so this probably isn't that common of an operation. I honestly can't think of a super compelling example of a "grouped update", so if anyone else can then I'd love to see it.
In my head mutate(.by =)
is shorthand for df %>% group_by() %>% mutate() %>% ungroup()
so .when
would be calculated after .by
. Just like the ...
in mutate()
would be calculated after .by
. This is the main reason why I think .when
should operate by group. Especially after a group_by()
I would be surprised that .when
doesn't operate by group but everything else in my pipe chain does.
As for data.table
- it's a bit of an interesting example because i
never operates by group but j
does. You have to do an extra step to filter by group in data.table
. However filter()
always operates by group in dplyr
. So .when
seems different to me than i
in data.table
.
Also worth mentioning that data.table
errors if you try and filter/mutate by group so I guess it behaves like SQL
in that way. Definitely lends to your thought that it's an uncommon operation. But is an ungrouped condition with a grouped update that much more common? Maybe it is 🤷♂️.
library(data.table)
df <- data.table(group = c("a", "a", "b"), val = as.double(1:3))
copy(df)[, .SD[val <= mean(val), val := mean(val)], by = group][]
#> Error: .SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.
Note (mainly to self): If we apply .when
by group then we might end up with groups that don't have any rows left in them, i.e.:
x = [7, 7, 9] [4, 5, 6]
g = [1, 1, 1] [2, 2, 2]
.when = x < 6
# would this "filter" groups, retaining the number of groups?
x = [double()] [4, 5]
g = [integer()] [2, 2]
# or drop empty groups?
x = [4, 5]
g = [2, 2]
The current implementation "drops" groups, but it is clear that that is intended because theoretically it currently filters with .when
before computing the groups with .by
, so there weren't really groups to "drop" in the first place because those rows didn't exist when the groups were computed.
I guess this is the filter(.preserve = TRUE/FALSE)
problem. I really don't want to add another argument for this case if we don't have to.
library(dplyr)
df <- tibble(
x = c(7, 7, 9, 4, 5, 6),
g = c(1, 1, 1, 2, 2, 2)
) %>%
group_by(g)
df %>%
filter(x < 6) %>%
group_data()
#> # A tibble: 1 × 2
#> g .rows
#> <dbl> <list<int>>
#> 1 2 [2]
df %>%
filter(x < 6, .preserve = TRUE) %>%
group_data()
#> # A tibble: 2 × 2
#> g .rows
#> <dbl> <list<int>>
#> 1 1 [0]
#> 2 2 [2]
Created on 2022-06-28 by the reprex package (v2.0.1)
An example where grouped mutates with .when
come to mind would be group-wise mean imputation of missing values or group-wise outlier trimming.
dat |> mutate(x = mean(x, na.rm = TRUE), .when = is.na(x), .by = g)
dat |> mutate(x = mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .when = x > mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .by = g)
@bwiernik it doesn't work that way, x = mean(x, na.rm = TRUE)
is only evaluated on the slice of dat
where .when = is.na(x)
is TRUE
, so you'd be computing the mean of a vector of missing values
just out of curiosity (and not understanding the underlying mechanics) why is this implemented as an argument .when
rather than a dplyr function when(...)
more akin to across()
?
imo, the syntactic form of .when
here feels quite unfamiliar to how I'd expect mutate to work. Especially in the cases where there are non-global whens. The colon is not used often (at all?) in other dplyr functions. At the same time, you'd want people to be able to have several changes in a single mutate, as we are used to.
So could it also not be that this:
tibble(
a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")
) |>
mutate(
a = 0: .when = is.na(a),
b = 'pick-up': .when = b == "truck",
c = c * 10: .when = c <100
)
could have looked like:
tibble(
a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")
) |>
mutate(
when(is.na(a), a = 0),
when(b == "truck", b = 'pick-up'),
when(c < 100, c = c * 10)
)
and arguably been a more familiar way of working with dplyr? Then again, this might have already been a discussion with good reasons to not go this path.
just out of curiosity (and not understanding the underlying mechanics) why is this implemented as an argument
.when
rather than a dplyr functionwhen(...)
more akin toacross()
?imo, the syntactic form of
.when
here feels quite unfamiliar to how I'd expect mutate to work. Especially in the cases where there are non-global whens. The colon is not used often (at all?) in other dplyr functions. At the same time, you'd want people to be able to have several changes in a single mutate, as we are used to.So could it also not be that this:
tibble( a = c(1,2,NA,3,5), b = c("car", "train", "bus", "truck", "motorcycle") ) |> mutate( a = 0: .when = is.na(a), b = 'pick-up': .when = b == "truck", c = c * 10: .when = c <100 )
could have looked like:
tibble( a = c(1,2,NA,3,5), b = c("car", "train", "bus", "truck", "motorcycle") ) |> mutate( when(is.na(a), a = 0), when(b == "truck", b = 'pick-up'), when(c < 100, c = c * 10) )
and arguably been a more familiar way of working with dplyr?
Then again, this might have already been a discussion with good reasons to not go this path.
The colon is not the actual api, that was just a random proposal. I think your syntax is good, and something they were considering as well.
it doesn't work that way,
x = mean(x, na.rm = TRUE)
is only evaluated on the slice of dat where.when = is.na(x)
isTRUE
@DavisVaughan I somehow missed this as well. To be honest I thought .when
was more or less an alternate syntax for if_else()
, but I guess this covers a slightly different use case? I had to go check and that's how data.table
works as well.
If that's the intended use of .when
I might have to retract my earlier vote 😅. It does seem highly unlikely to need this to work by slice and by group.
I think we are going to close this for now. We aren't entirely convinced that this will benefit a large part of the user base, as we have struggled to come up with a large amount of examples where this is useful - outside of replacing missing values.
The addition of case_match()
and the ptype
argument to case_when()
(for type stability) also makes this feature less impactful.
We may return to this idea in the future.