forcats icon indicating copy to clipboard operation
forcats copied to clipboard

Add function that creates factor in order of case_when matches

Open dchiu911 opened this issue 3 years ago • 5 comments

A common workflow I do is map one vector to another using some (possibly complex) conditions, then coerce to a factor with the level order the same as parsed in dplyr::case_when(). It would be helpful if there was a wrapper that created the factor without having to manually specify the levels. Currently, I'd do something like this:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
) %>%
  factor(levels = c("B", "A", "C"))
str(y)
#>  Factor w/ 3 levels "B","A","C": 1 3 2 3 2 3 2 1 1 3 ...

Created on 2022-02-01 by the reprex package (v2.0.1)

Can we add a function that makes y into a factor with the level order the same as specified in the case_when()? For example,

y <- fct_case(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_
)

dchiu911 avatar Feb 01 '22 19:02 dchiu911

I think we'd need to make the syntax more limiting than case_when() because the RHS of a case_when() can itself use data values, and reasoning through how those values should interact between conditions seems hard.

Since we'd want to restrict each expression to a single character level, we could put it in the LHS of =, something like:

something(
  "B" = x == "intermediate" | (x == "low" & z < 30),
  "A" = x == "low",
  "C" = x == "high",
)

But I don't know if any existing tidyverse function uses similar syntax.

hadley avatar May 20 '22 00:05 hadley

I do think removing the usage of ~ would make it more consistent as case_when() syntax is quite unique

dchiu911 avatar May 20 '22 04:05 dchiu911

But I don't know if any existing tidyverse function uses similar syntax.

FWIW this is basically how fct_recode() works (name represents new level, value was the old level), so it wouldn't be unheard to let the name represent the new level, and the value be the logical condition

DavisVaughan avatar Aug 17 '22 16:08 DavisVaughan

Will wait until lower level functions are exposed by vctrs.

hadley avatar Jan 09 '23 21:01 hadley

I would think it would be convenient to solve this from the case_when() itself:

Something like this:

set.seed(2022)
x <- sample(
  c("low", "intermediate", "high"),
  prob = c(0.5, 0.2, 0.3),
  size = 100,
  replace = TRUE
)
z <- rbinom(
  n = 100,
  size = 100,
  prob = 0.3
)
y <- case_when(
  x == "intermediate" | (x == "low" & z < 30) ~ "B",
  x == "low" ~ "A",
  x == "high" ~ "C",
  TRUE ~ NA_character_,
  .ptype = "factor"
)

brianmsm avatar Jan 05 '24 19:01 brianmsm