dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

Conditionally mutate selected rows

Open krlmlr opened this issue 6 years ago • 12 comments

This would allow supporting an efficient mutate_if_row() verb here or elsewhere (assuming there's also a nice way to set the group data, as implemented in update_group_data() here). I remember a discussion about using the group data for other exciting things such as bootstrapping?

In the example below, the first three rows should remain unchanged.

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

update_group_data <- function(.data, group_data) {
  attr(.data, "groups") <- group_data
  .data
}

group_filter <- function(.data, ...) {
  new_group_data <-
    .data %>%
    group_data() %>%
    filter(...)

  .data %>%
    update_group_data(new_group_data)
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)

  .data %>%
    group_by(.flag = !!cond) %>%
    group_filter(.flag) %>%
    mutate(...) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1    NA
#> 2    NA
#> 3    NA
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

krlmlr avatar Dec 21 '18 15:12 krlmlr

We can fake it already, but overwriting would be a tad faster:

library(tidyverse)

df <- tibble(a = 1:5)
df
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5

if_flag <- function(quo, name) {
  rlang::quo_set_expr(
    quo,
    expr(if (.flag[1]) !!rlang::quo_get_expr(quo) else !!rlang::sym(name))
  )
}

mutate_if_row <- function(.data, cond, ...) {
  cond <- rlang::enquo(cond)
  quos <- rlang::quos(...)

  quos <- map2(quos, names(quos), if_flag)

  .data %>%
    group_by(.flag = !!cond) %>%
    mutate(!!!quos) %>%
    ungroup() %>%
    select(-.flag)
}

df %>%
  mutate_if_row(a > 3, a = a + 1L)
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

krlmlr avatar Dec 21 '18 15:12 krlmlr

Also not that trivial to implement. We can only realistically do that when R says the object has only one reference.

This, to me, looks like modify by reference, à la data.table, and is out of scope for dplyr.

This sounds like a use case for case_when:

library(dplyr)

df <- tibble(a = 1:5)

df %>%
  mutate(a = case_when(
    a > 3 ~ a + 1L, 
    TRUE  ~ a
  ))
#> # A tibble: 5 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

Created on 2018-12-21 by the reprex package (v0.2.1.9000)

romainfrancois avatar Dec 21 '18 16:12 romainfrancois

mutate_if_row() is better, less noise and works for updating multiple columns at once. I've heard this question now multiple times in workshops.

I see your point, we need to copy anyway, even if R says it has only one copy. Copying via memcpy() or R's duplication mechanism still will be faster.

Maybe something to consider for 0.9.0?

krlmlr avatar Dec 21 '18 16:12 krlmlr

I see, I think I've been confused by the group_by(). see also mutate_when() from https://gist.github.com/romainfrancois/eeeed972d6734bcad3ec3dcf872df7ea

library(rlang)
library(dplyr)
library(purrr)

mutate_when <- function(data, condition, ...){
  condition <- enquo(condition)
  
  dots <- exprs(...)
  
  expressions <- map2( dots, syms(names(dots)), ~{
    quo( case_when(..condition.. ~ !!.x , TRUE ~ !!.y ) )
  })
  
  data %>%
    mutate( ..condition.. = !!condition ) %>%
    mutate( !!!expressions ) %>%
    select( -..condition..)
}

d <- tibble( x = 1:4, y = 1:4)
mutate_when( d, x < 3, 
  x = -x, 
  y = -y
)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3     3     3
#> 4     4     4

Created on 2019-01-29 by the reprex package (v0.2.1.9000)

romainfrancois avatar Jan 29 '19 11:01 romainfrancois

Here are some approaches using data frame returns:

  • "manually":
library(dplyr)
d <- tibble( x = 1:4, y = 1:4)

# using data frame returns
d %>% 
  mutate({
    test <- x < 4
    x[test] <- -x[test]
    y[test] <- -y[test]
    data.frame(x = x, y = y)
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

if we want to do the same thing to a selected set of columns, we can use across() and a bit of code around:

# using across()
d %>% 
  mutate({
    test <- x < 4
    across(c(x, y), ~ {.x[test] <- -.x[test]; .x })
  })
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

and we can further abstract, e.g.

negate_if <- function(condition, cols) {
  across({{ cols }}, ~ {
    .x[condition] <- -.x[condition]
    .x
  })
}
d %>% 
  mutate(negate_if(x < 4, c(x, y)))
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

Now if we want to do arbitrary mutations, e.g. mutate_when(d, x < 4, x = -x, y = -y) we can do something like this, with some assumptions:

mutate_when <- function(.data, when, ...) {
  dots <- enquos(...)
  names <- names(dots)
  
  mutate(.data, {
    test <- {{ when }}
    
    changed <- data.frame(!!!dots)
    out <- across(all_of(names))
    # assuming `changed` and `out` have the same data frame type

    out[test, ] <- changed[test, ]
    out
  })
  
}
mutate_when(d, x < 4, x = -x, y = -y)
#> # A tibble: 4 x 2
#>       x     y
#>   <int> <int>
#> 1    -1    -1
#> 2    -2    -2
#> 3    -3    -3
#> 4     4     4

Created on 2021-04-21 by the reprex package (v0.3.0)

This all feels like things we can do with the tools available, perhaps in some other package ?

romainfrancois avatar Apr 21 '21 13:04 romainfrancois

I just wanted to mention mutate_when() is a function I would love to see incorporated. I posted this question on stack overflow, essentially asking if there was a more simplistic syntax for creating variables in a mutate, without a bunch of repetitive case_when() or ifelse() statements.

In my particular use case, I am creating code which creates output based upon a flow chart. My intended end users are less familiar with R, and I don't want them to get overwhelmed by the sheer volume of repetitive code. IMO, this mutate_when() function is intuitive in conjunction with pipe %>%.

Because I am naive and new, I thought something like this might work...

data %>%
group_by(g1, g2) %>%
mutate(
   across(
      where(
         condition
      )
   ),
var1 = "happy",
var2 = var22 / var19 + 3,
var3 = ifelse(
   var2 >= 3,
   TRUE,
   FALSE
   ),
...more var statements...
 )

Thanks @romainfrancois for posting this function.

k6adams avatar Nov 24 '21 17:11 k6adams

We gave a serious attempt at this in #6313 for dplyr 1.1.0, but ultimately decided not to add it in that release.

We aren't convinced that it is an operation that would be heavily used, as the main example usage we could come up with was replacing missing values, i.e.:

mutate(df, x = 0, .when = is.na(x))

We can't think of many examples beyond this one where this would be very useful.

Here are a few notes we should consider in the future when thinking about this:

  • Should groups be ignored when computing .when? To match SQL and data table, it makes sense to ignore groups. I also can't think of any examples where a grouped application of .when makes sense. But this confused some people, especially because they might be passing a grouped-df in, like group_by() %>% mutate(.when =). This becomes slightly less confusing in the context of .by, i.e. mutate(.when =, .by = ), where we'd just document that .when is applied first.

  • To be performant, we have to hook this into the data mask. You have to evaluate .when to get the locations where ... should be applied, and then only the columns referenced in ... should be sliced to the locations referenced by .when. We did this successfully in #6313, but it required a decent chunk of refactoring.

  • Would you ever want more than 1 .when call in a single mutate()? Some people proposed an API of mutate(when(is.na(x), x = 0), when(y == 4, a = 5, b = 6)). I don't personally think this would be that useful. If we did this, we might also consider making sequential when() calls work in a case-when like fashion.

  • Should .when allow if_any() and if_all() in the expression? It seems like they might be useful as a way to compute a complex when expression based on multiple columns, but is somewhat hard to implement. We didn't do that in #6313.

We have to think about how useful this function is in light of the fact that we now have the ability to create type stable case_when() and case_match() calls. i.e. this handles the most common case of .when:

mutate(
  x = case_match(x, NA ~ 0, .ptype = x, .default = x)
)

And that could be wrapped into a replace_match(x, NA ~ 0) helper. Updating multiple columns based on 1 condition is also something if_else() can do now:

mutate(
  if_else(
    is.na(x) | is.na(y),
    tibble(x = 0, y = 0),
    tibble(x = x, y = y)
  )
)

DavisVaughan avatar Sep 13 '22 18:09 DavisVaughan

A nice little alternative to mutate(.when = ) we could consider. Slightly simpler than if_else or case_when equivalents, also type and size stable by default, and takes integer positions for i rather than just logical ones. Also supports data frames for x and value if you want to use 1 condition for multiple columns, like the example just above.

replace_at <- function(x, i, value) {
  size <- vctrs::vec_size(x)
  
  i <- vctrs::vec_as_location(i = i, n = size, missing = "remove")
  
  # recycle up to size of x
  value <- vctrs::vec_recycle(value, size, x_arg = "value")
  
  # slice down to locations selected by i
  value <- vctrs::vec_slice(value, i)
  
  vctrs::vec_assign(x, i, value)
}

# with a vector the same size as x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, -dep_delay)
)

# with a value
mutate(
  flights,
  dep_delay = replace_at(dep_delay, dep_time > 500, NA)
)

# at integer locations in x
mutate(
  flights,
  dep_delay = replace_at(dep_delay, c(5, 3), NA)
)

DavisVaughan avatar Nov 02 '23 21:11 DavisVaughan

How about

mutate(flights, replace_at(dep_time > 500, dep_delay = -dep_delay))

with replace_at() returning a suitable data frame?

krlmlr avatar Nov 03 '23 05:11 krlmlr

That can't be written as a standalone function IIUC. My hope was that we could figure out something that works outside of dplyr too

DavisVaughan avatar Nov 03 '23 12:11 DavisVaughan

I'm thinking about something along the following lines:

options(conflicts.policy = list(warn = FALSE))
library(rlang)
library(vctrs)
library(tibble)
library(dplyr)
library(purrr)

replace_at <- function(where, ..., .envir = parent.frame()) {
  replacement <- tibble(...)

  orig_names <- names(replacement)
  orig_values <- as_tibble(map(set_names(orig_names), get0, .envir))

  vec_assign(orig_values, where, replacement)
}

foo <- 1:3
replace_at(2, foo = 5)
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

tibble(foo) |>
  mutate(replace_at(2, foo = 5))
#> # A tibble: 3 × 1
#>     foo
#>   <int>
#> 1     1
#> 2     5
#> 3     3

Created on 2023-11-03 with reprex v2.0.2

krlmlr avatar Nov 03 '23 15:11 krlmlr

tidygraph now has a focus() verb that sort of does this

thomasp85 avatar Feb 26 '24 17:02 thomasp85