dplyr
dplyr copied to clipboard
Feature request: Pipe-supportive pass-through of unrecoded elements using `case_match()`?
case_match()
is great to have as a more pipe-able alternative to case_when()
following modern tidyverse approaches. However, it lacks an equivalent capabilities to one nice feature of its predecessor recode()
: namely, where non-recoded elements could be allowed to pass through into the output. While there is a workaround for this missing feature provided in the help examples, I'm realizing it may not generalize well across use cases.
An example similar to what I encountered:
# Some filenames:
filenames <-
c("juniors_2022_OU.csv", "roster_2023_juniors_notre_dame.csv", "2022_SOU_juniors_roster.csv")
# Begin building tibble to (eventually) import & organize the content from these files:
tibble::tibble(
# Import filenames (in practice, would be a call to list.files() ):
file_names = filenames,
# Get key info like year and school out of filenames:
# Year is straightforward:
id_year = file_names |> stringr::str_extract("202[23]"),
#School, however, needs initialisms expanded to minimize ambiguity:
# recode() can do this inside a single set of piped instructions. It
# fills in the original data when items are not recoded (& same
# type as recoded output):
id_school_recode =
file_names |>
# Remove non-school-name content:
stringr::str_remove(".csv$") |>
stringr::str_remove("_?202[23]_?") |>
stringr::str_remove("_?juniors_?") |>
stringr::str_remove("_?roster_?") |>
# Recode initialisms:
dplyr::recode(
"OU" = "oklahoma",
"SOU" = "southern_oregon"
),
# Using case_match(), though, non-recoded elements become NA.
id_school_casematch =
file_names |>
stringr::str_remove(".csv$") |>
stringr::str_remove("_?202[23]_?") |>
stringr::str_remove("_?juniors_?") |>
stringr::str_remove("_?roster_?") |>
dplyr::case_match(
"OU" ~ "oklahoma",
"SOU" ~ "southern_oregon"
)
)
#> # A tibble: 3 × 4
#> file_names id_year id_school_recode id_school_casematch
#> <chr> <chr> <chr> <chr>
#> 1 juniors_2022_OU.csv 2022 oklahoma oklahoma
#> 2 roster_2023_juniors_notre_dame.c… 2023 notre_dame <NA>
#> 3 2022_SOU_juniors_roster.csv 2022 southern_oregon southern_oregon
Help docs for case_match() has an example that uses argument .default = <<varname>>
(e.g. species) to fill back in original data. This would work in mutate()
, but not here in tibble()
- using the approach requires multiple arguments specifying the same column/variable, which tibble() forbids:
tibble::tibble(
# Locate and remove non-name content:
id_school_casematch =
filenames |>
stringr::str_remove(".csv$") |>
stringr::str_remove("_?202[23]_?") |>
stringr::str_remove("_?juniors_?") |>
stringr::str_remove("_?roster_?"),
# Recode names:
id_school_casematch =
id_school_casematch |>
dplyr::case_match(
"OU" ~ "oklahoma",
"SOU" ~ "southern_oregon",
.default = id_school_casematch
)
)
#> Error in `tibble::tibble()`:
#> ! Column name `id_school_casematch` must not be duplicated.
#> Use `.name_repair` to specify repair.
#> Caused by error in `repaired_names()`:
#> ! Names must be unique.
#> ✖ These names are duplicated:
#> * "id_school_casematch" at locations 1 and 2.
(also, even when using mutate or standard variable creation [<-], specifying any new column/variable requires 2 separate arguments/calls. Admittedly my personal opinion, but that does seems less concise/readable to me, and/or not fully capitalizing on the pipe-ability that seems to be case_match()
's key offering.)
Could case_match()
perhaps have an option/default added, e.g. case_match(.x, ..., .default = .x)
, that would mirror recode()
's capabilities? I recognize you'd still need equivalents to recode()
's checks ensuring that new & pass-through content have the same type--but wouldn't that be manageable, given the vctrs
underpinnings of this function?