dplyr
dplyr copied to clipboard
Conditionally mutate selected rows
This would allow supporting an efficient mutate_if_row() verb here or elsewhere (assuming there's also a nice way to set the group data, as implemented in update_group_data() here). I remember a discussion about using the group data for other exciting things such as bootstrapping?
In the example below, the first three rows should remain unchanged.
library(tidyverse)
df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
update_group_data <- function(.data, group_data) {
attr(.data, "groups") <- group_data
.data
}
group_filter <- function(.data, ...) {
new_group_data <-
.data %>%
group_data() %>%
filter(...)
.data %>%
update_group_data(new_group_data)
}
mutate_if_row <- function(.data, cond, ...) {
cond <- rlang::enquo(cond)
.data %>%
group_by(.flag = !!cond) %>%
group_filter(.flag) %>%
mutate(...) %>%
ungroup() %>%
select(-.flag)
}
df %>%
mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#> a
#> <int>
#> 1 NA
#> 2 NA
#> 3 NA
#> 4 5
#> 5 6
Created on 2018-12-21 by the reprex package (v0.2.1.9000)
We can fake it already, but overwriting would be a tad faster:
library(tidyverse)
df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5
if_flag <- function(quo, name) {
rlang::quo_set_expr(
quo,
expr(if (.flag[1]) !!rlang::quo_get_expr(quo) else !!rlang::sym(name))
)
}
mutate_if_row <- function(.data, cond, ...) {
cond <- rlang::enquo(cond)
quos <- rlang::quos(...)
quos <- map2(quos, names(quos), if_flag)
.data %>%
group_by(.flag = !!cond) %>%
mutate(!!!quos) %>%
ungroup() %>%
select(-.flag)
}
df %>%
mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6
Created on 2018-12-21 by the reprex package (v0.2.1.9000)
Also not that trivial to implement. We can only realistically do that when R says the object has only one reference.
This, to me, looks like modify by reference, à la data.table, and is out of scope for dplyr.
This sounds like a use case for case_when:
library(dplyr)
df <- tibble(a = 1:5)
df %>%
mutate(a = case_when(
a > 3 ~ a + 1L,
TRUE ~ a
))
#> # A tibble: 5 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 5
#> 5 6
Created on 2018-12-21 by the reprex package (v0.2.1.9000)
mutate_if_row() is better, less noise and works for updating multiple columns at once. I've heard this question now multiple times in workshops.
I see your point, we need to copy anyway, even if R says it has only one copy. Copying via memcpy() or R's duplication mechanism still will be faster.
Maybe something to consider for 0.9.0?
I see, I think I've been confused by the group_by(). see also mutate_when() from https://gist.github.com/romainfrancois/eeeed972d6734bcad3ec3dcf872df7ea
library(rlang)
library(dplyr)
library(purrr)
mutate_when <- function(data, condition, ...){
condition <- enquo(condition)
dots <- exprs(...)
expressions <- map2( dots, syms(names(dots)), ~{
quo( case_when(..condition.. ~ !!.x , TRUE ~ !!.y ) )
})
data %>%
mutate( ..condition.. = !!condition ) %>%
mutate( !!!expressions ) %>%
select( -..condition..)
}
d <- tibble( x = 1:4, y = 1:4)
mutate_when( d, x < 3,
x = -x,
y = -y
)
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 -1 -1
#> 2 -2 -2
#> 3 3 3
#> 4 4 4
Created on 2019-01-29 by the reprex package (v0.2.1.9000)
Here are some approaches using data frame returns:
- "manually":
library(dplyr)
d <- tibble( x = 1:4, y = 1:4)
# using data frame returns
d %>%
mutate({
test <- x < 4
x[test] <- -x[test]
y[test] <- -y[test]
data.frame(x = x, y = y)
})
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 -1 -1
#> 2 -2 -2
#> 3 -3 -3
#> 4 4 4
if we want to do the same thing to a selected set of columns, we can use across() and a bit of code around:
# using across()
d %>%
mutate({
test <- x < 4
across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
})
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 -1 -1
#> 2 -2 -2
#> 3 -3 -3
#> 4 4 4
and we can further abstract, e.g.
negate_if <- function(condition, cols) {
across({{ cols }}, ~ {
.x[condition] <- -.x[condition]
.x
})
}
d %>%
mutate(negate_if(x < 4, c(x, y)))
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 -1 -1
#> 2 -2 -2
#> 3 -3 -3
#> 4 4 4
Now if we want to do arbitrary mutations, e.g. mutate_when(d, x < 4, x = -x, y = -y) we can do something like this, with some assumptions:
mutate_when <- function(.data, when, ...) {
dots <- enquos(...)
names <- names(dots)
mutate(.data, {
test <- {{ when }}
changed <- data.frame(!!!dots)
out <- across(all_of(names))
# assuming `changed` and `out` have the same data frame type
out[test, ] <- changed[test, ]
out
})
}
mutate_when(d, x < 4, x = -x, y = -y)
#> # A tibble: 4 x 2
#> x y
#> <int> <int>
#> 1 -1 -1
#> 2 -2 -2
#> 3 -3 -3
#> 4 4 4
Created on 2021-04-21 by the reprex package (v0.3.0)
This all feels like things we can do with the tools available, perhaps in some other package ?
I just wanted to mention mutate_when() is a function I would love to see incorporated. I posted this question on stack overflow, essentially asking if there was a more simplistic syntax for creating variables in a mutate, without a bunch of repetitive case_when() or ifelse() statements.
In my particular use case, I am creating code which creates output based upon a flow chart. My intended end users are less familiar with R, and I don't want them to get overwhelmed by the sheer volume of repetitive code. IMO, this mutate_when() function is intuitive in conjunction with pipe %>%.
Because I am naive and new, I thought something like this might work...
data %>%
group_by(g1, g2) %>%
mutate(
across(
where(
condition
)
),
var1 = "happy",
var2 = var22 / var19 + 3,
var3 = ifelse(
var2 >= 3,
TRUE,
FALSE
),
...more var statements...
)
Thanks @romainfrancois for posting this function.
We gave a serious attempt at this in #6313 for dplyr 1.1.0, but ultimately decided not to add it in that release.
We aren't convinced that it is an operation that would be heavily used, as the main example usage we could come up with was replacing missing values, i.e.:
mutate(df, x = 0, .when = is.na(x))
We can't think of many examples beyond this one where this would be very useful.
Here are a few notes we should consider in the future when thinking about this:
-
Should groups be ignored when computing
.when? To match SQL and data table, it makes sense to ignore groups. I also can't think of any examples where a grouped application of.whenmakes sense. But this confused some people, especially because they might be passing a grouped-df in, likegroup_by() %>% mutate(.when =). This becomes slightly less confusing in the context of.by, i.e.mutate(.when =, .by = ), where we'd just document that.whenis applied first. -
To be performant, we have to hook this into the data mask. You have to evaluate
.whento get the locations where...should be applied, and then only the columns referenced in...should be sliced to the locations referenced by.when. We did this successfully in #6313, but it required a decent chunk of refactoring. -
Would you ever want more than 1
.whencall in a singlemutate()? Some people proposed an API ofmutate(when(is.na(x), x = 0), when(y == 4, a = 5, b = 6)). I don't personally think this would be that useful. If we did this, we might also consider making sequentialwhen()calls work in a case-when like fashion. -
Should
.whenallowif_any()andif_all()in the expression? It seems like they might be useful as a way to compute a complex when expression based on multiple columns, but is somewhat hard to implement. We didn't do that in #6313.
We have to think about how useful this function is in light of the fact that we now have the ability to create type stable case_when() and case_match() calls. i.e. this handles the most common case of .when:
mutate(
x = case_match(x, NA ~ 0, .ptype = x, .default = x)
)
And that could be wrapped into a replace_match(x, NA ~ 0) helper. Updating multiple columns based on 1 condition is also something if_else() can do now:
mutate(
if_else(
is.na(x) | is.na(y),
tibble(x = 0, y = 0),
tibble(x = x, y = y)
)
)
A nice little alternative to mutate(.when = ) we could consider. Slightly simpler than if_else or case_when equivalents, also type and size stable by default, and takes integer positions for i rather than just logical ones. Also supports data frames for x and value if you want to use 1 condition for multiple columns, like the example just above.
replace_at <- function(x, i, value) {
size <- vctrs::vec_size(x)
i <- vctrs::vec_as_location(i = i, n = size, missing = "remove")
# recycle up to size of x
value <- vctrs::vec_recycle(value, size, x_arg = "value")
# slice down to locations selected by i
value <- vctrs::vec_slice(value, i)
vctrs::vec_assign(x, i, value)
}
# with a vector the same size as x
mutate(
flights,
dep_delay = replace_at(dep_delay, dep_time > 500, -dep_delay)
)
# with a value
mutate(
flights,
dep_delay = replace_at(dep_delay, dep_time > 500, NA)
)
# at integer locations in x
mutate(
flights,
dep_delay = replace_at(dep_delay, c(5, 3), NA)
)
How about
mutate(flights, replace_at(dep_time > 500, dep_delay = -dep_delay))
with replace_at() returning a suitable data frame?
That can't be written as a standalone function IIUC. My hope was that we could figure out something that works outside of dplyr too
I'm thinking about something along the following lines:
options(conflicts.policy = list(warn = FALSE))
library(rlang)
library(vctrs)
library(tibble)
library(dplyr)
library(purrr)
replace_at <- function(where, ..., .envir = parent.frame()) {
replacement <- tibble(...)
orig_names <- names(replacement)
orig_values <- as_tibble(map(set_names(orig_names), get0, .envir))
vec_assign(orig_values, where, replacement)
}
foo <- 1:3
replace_at(2, foo = 5)
#> # A tibble: 3 × 1
#> foo
#> <int>
#> 1 1
#> 2 5
#> 3 3
tibble(foo) |>
mutate(replace_at(2, foo = 5))
#> # A tibble: 3 × 1
#> foo
#> <int>
#> 1 1
#> 2 5
#> 3 3
Created on 2023-11-03 with reprex v2.0.2
tidygraph now has a focus() verb that sort of does this