dplyr icon indicating copy to clipboard operation
dplyr copied to clipboard

Feature request: Pipe-supportive pass-through of unrecoded elements using `case_match()`?

Open jmobrien opened this issue 1 year ago • 0 comments

case_match() is great to have as a more pipe-able alternative to case_when() following modern tidyverse approaches. However, it lacks an equivalent capabilities to one nice feature of its predecessor recode(): namely, where non-recoded elements could be allowed to pass through into the output. While there is a workaround for this missing feature provided in the help examples, I'm realizing it may not generalize well across use cases.

An example similar to what I encountered:

# Some filenames:
filenames <- 
  c("juniors_2022_OU.csv", "roster_2023_juniors_notre_dame.csv",  "2022_SOU_juniors_roster.csv")

# Begin building tibble to (eventually) import & organize the content from these files:
tibble::tibble(

  # Import filenames (in practice, would be a call to list.files() ):
  file_names = filenames,
  
  # Get key info like year and school out of filenames: 
  
  # Year is straightforward:
  id_year = file_names |> stringr::str_extract("202[23]"),
  
  #School, however, needs initialisms expanded to minimize ambiguity:
  
  # recode() can do this inside a single set of piped instructions. It
  # fills in the original data when items are not recoded (& same 
  # type as recoded output):

  id_school_recode = 
    file_names |> 
    # Remove non-school-name content:
      stringr::str_remove(".csv$") |> 
      stringr::str_remove("_?202[23]_?") |> 
      stringr::str_remove("_?juniors_?") |> 
      stringr::str_remove("_?roster_?") |>
    # Recode initialisms:
      dplyr::recode(
        "OU" = "oklahoma",
        "SOU" = "southern_oregon"
      ),
  
  # Using case_match(), though, non-recoded elements become NA.
  id_school_casematch = 
    file_names |> 
      stringr::str_remove(".csv$") |> 
      stringr::str_remove("_?202[23]_?") |> 
      stringr::str_remove("_?juniors_?") |> 
      stringr::str_remove("_?roster_?") |> 
      dplyr::case_match(
        "OU" ~ "oklahoma",
        "SOU" ~ "southern_oregon"
      )
)
#> # A tibble: 3 × 4
#>   file_names                        id_year id_school_recode id_school_casematch
#>   <chr>                             <chr>   <chr>            <chr>              
#> 1 juniors_2022_OU.csv               2022    oklahoma         oklahoma           
#> 2 roster_2023_juniors_notre_dame.c… 2023    notre_dame       <NA>               
#> 3 2022_SOU_juniors_roster.csv       2022    southern_oregon  southern_oregon
 

Help docs for case_match() has an example that uses argument .default = <<varname>> (e.g. species) to fill back in original data. This would work in mutate(), but not here in tibble() - using the approach requires multiple arguments specifying the same column/variable, which tibble() forbids:

tibble::tibble(

  # Locate and remove non-name content:
  id_school_casematch = 
    filenames |> 
    stringr::str_remove(".csv$") |> 
    stringr::str_remove("_?202[23]_?") |> 
    stringr::str_remove("_?juniors_?") |> 
    stringr::str_remove("_?roster_?"),

  # Recode names:
  id_school_casematch = 
    id_school_casematch |> 
    dplyr::case_match(
      "OU" ~ "oklahoma",
      "SOU" ~ "southern_oregon",
      .default = id_school_casematch
    )
)
#> Error in `tibble::tibble()`:
#> ! Column name `id_school_casematch` must not be duplicated.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#>   * "id_school_casematch" at locations 1 and 2.

(also, even when using mutate or standard variable creation [<-], specifying any new column/variable requires 2 separate arguments/calls. Admittedly my personal opinion, but that does seems less concise/readable to me, and/or not fully capitalizing on the pipe-ability that seems to be case_match()'s key offering.)

Could case_match() perhaps have an option/default added, e.g. case_match(.x, ..., .default = .x), that would mirror recode()'s capabilities? I recognize you'd still need equivalents to recode()'s checks ensuring that new & pass-through content have the same type--but wouldn't that be manageable, given the vctrs underpinnings of this function?

jmobrien avatar Nov 09 '23 16:11 jmobrien