dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

Draft `case_match()` and `vec_case_match()`

Open DavisVaughan opened this issue 3 years ago • 1 comments

case_match() is a variant of case_when() that takes a primary input, .x, and then a series of formulas where the LHSs of each formula are values to match against .x rather than logical vectors. The LHSs get turned into logical conditions by vec_in(), and then the results are passed on to vec_case_when().

It technically closes https://github.com/tidyverse/funs/issues/60

This would function as a direct successor to recode(), which is already questioning and has an awkward interface for anything except character vectors (and even there it can be odd).

char_vec <- sample(c("a", "b", "c"), 10, replace = TRUE)

recode(char_vec, a = "Apple", b = "Banana")
#>  [1] "Banana" "Banana" "c"      "Banana" "c"      "Banana"
#>  [7] "Apple"  "Banana" "c"      "Banana"

case_match(
  char_vec,
  "a" ~ "Apple",
  "b" ~ "Banana",
  .default = char_vec
)
#>  [1] "Banana" "Banana" "c"      "Banana" "c"      "Banana"
#>  [7] "Apple"  "Banana" "c"      "Banana"


recode(char_vec, a = "Apple", b = "Banana", .default = NA_character_)
#>  [1] "Banana" "Banana" NA       "Banana" NA       "Banana"
#>  [7] "Apple"  "Banana" NA       "Banana"

case_match(
  char_vec,
  "a" ~ "Apple",
  "b" ~ "Banana"
)
#>  [1] "Banana" "Banana" NA       "Banana" NA       "Banana"
#>  [7] "Apple"  "Banana" NA       "Banana"


# `case_match()` is more general and works elegantly 
# with more than just character
num_vec <- c(1:4, NA)

recode(num_vec, `1` = "o", `2` = "e", `3` = "o", `4` = "e", .missing = "m")
#> [1] "o" "e" "o" "e" "m"

case_match(
  num_vec,
  c(1, 3) ~ "o",
  c(2, 4) ~ "e",
  NA ~ "m"
)
#> [1] "o" "e" "o" "e" "m"


# More of a programmatic usage
level_key <- c(a = "apple", b = "banana", c = "carrot")
recode(char_vec, !!!level_key)
#>  [1] "banana" "banana" "carrot" "banana" "carrot" "banana"
#>  [7] "apple"  "banana" "carrot" "banana"

vec_case_match(
  needles = char_vec,
  haystacks = as.list(names(level_key)),
  values = as.list(level_key),
  default = char_vec
)
#>  [1] "banana" "banana" "carrot" "banana" "carrot" "banana"
#>  [7] "apple"  "banana" "carrot" "banana"

I still think a replace_match() would be useful here, like:

# type stable replacement wrapper around case_match()
replace_match <- function(.x, ...) {
  ptype <- vec_ptype(.x)
  ptype <- vec_ptype_finalise(ptype)
  case_match(.x = .x, ..., .default = .x, .ptype = ptype)
}

# very close to compactness of recode()
replace_match(
  char_vec,
  "a" ~ "Apple",
  "b" ~ "Banana"
)

# instead of 
case_match(
  char_vec,
  "a" ~ "Apple",
  "b" ~ "Banana",
  .default = char_vec
)

replace_match() could also be used instead of a match-like version of na_if()

x <- c("a", "NA", "NaN", "no")
replace_match(x, c("NA", "NaN", "no") ~ NA)

In forcats, we could have fct_case_match() as a successor to recode_factor(), but its interface would probably be the other way around, like:

fct_case_match(
  .x,
  odd = c(1, 3),
  even = c(2, 4),
  ordered = FALSE
)

fct_case_when(
  odd = .x %in% c(1, 3),
  even = .x %in% c(2, 4),
  ordered = FALSE
)

DavisVaughan avatar Jul 12 '22 19:07 DavisVaughan

A better name for this might be case_switch(). i.e. it is a vectorized switch statement.

It just has the nice property of being able to collapse cases with the same right-hand sides into one line

case_switch(
  num_vec,
  1 ~ "o",
  3 ~ "o",
  2 ~ "e",
  4 ~ "e",
  NA ~ "m"
)

case_switch(
  num_vec,
  c(1, 3) ~ "o",
  c(2, 4) ~ "e",
  NA ~ "m"
)

The whole point of case_switch() is to mimic the SQL "simple" CASE statement. Our case_when() handles the "searched" CASE statement. data.table has also been considering something like this https://github.com/Rdatatable/data.table/issues/4820

DavisVaughan avatar Jul 16 '22 20:07 DavisVaughan

Do we want to superseded recode() in this PR or a separate one?

I'll leave that for another PR, I want to get this one in

DavisVaughan avatar Aug 18 '22 17:08 DavisVaughan