recipes
recipes copied to clipboard
Allow finer control over variable name prefixes
For steps that create derived variables, such as from a date, the new variables are prefixed with the date column name:
library(tidymodels)
d <- data.frame(y = rnorm(10),
date = Sys.Date() + 1:10)
rec <- recipe(y~.,data = d) %>%
step_date(date,features = c("dow","decimal")) %>%
step_holiday(date,holidays = c("LaborDay","ChristmasDay"))
rec %>%
prep() %>%
bake(new_data = NULL)
Often you will want to do further processing on these variables, but selecting specific sets of them using tidyselect is challenging because they all get the same prefix, "date_". For example, we might want to use step_dummy()
and step_bin2factor()
on the day of week and holiday features, but selecting those, but not numeric ones like decimal becomes pretty tedious, particularly when there are lots of date features.
Hello @joranE 👋
You are right that the columns coming out of step_holiday()
can be a little hard to match as they don't have any holidays, they do however match the names of the holiday you passed to holidays
, making it so you can use contains()
to match them.
As for the "day of the week", those do get an infix of _dow_
so you can match them with date_dow_
.
library(tidymodels)
d <- data.frame(y = rnorm(10),
date = Sys.Date() + 1:10)
my_holidays <- c("LaborDay", "ChristmasDay")
rec <- recipe(y~.,data = d) %>%
step_date(date,features = c("dow", "decimal")) %>%
step_holiday(date,holidays = my_holidays) %>%
step_dummy(starts_with("date_dow_")) %>%
step_bin2factor(contains(my_holidays))
rec %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 10 × 11
#> date y date_decimal date_LaborDay date_ChristmasDay date_dow_Mon
#> <date> <dbl> <dbl> <fct> <fct> <dbl>
#> 1 2022-07-02 0.438 2022. no no 0
#> 2 2022-07-03 -0.653 2023. no no 0
#> 3 2022-07-04 -0.648 2023. no no 1
#> 4 2022-07-05 -0.706 2023. no no 0
#> 5 2022-07-06 0.174 2023. no no 0
#> 6 2022-07-07 0.636 2023. no no 0
#> 7 2022-07-08 -1.88 2023. no no 0
#> 8 2022-07-09 0.936 2023. no no 0
#> 9 2022-07-10 1.54 2023. no no 0
#> 10 2022-07-11 0.928 2023. no no 1
#> # … with 5 more variables: date_dow_Tue <dbl>, date_dow_Wed <dbl>,
#> # date_dow_Thu <dbl>, date_dow_Fri <dbl>, date_dow_Sat <dbl>
Also, because this reprex doesn't show it. The prefix here isn't date
because we are working with dates, it is date
because that was the name of the variable we applied it to. step_date()
and step_holiday()
uses the variable name as prefix.
library(tidymodels)
d <- data.frame(y = rnorm(10),
new_crazy_name = Sys.Date() + 1:10)
recipe(y ~ ., data = d) %>%
step_date(new_crazy_name, features = c("dow", "decimal")) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 10 × 4
#> new_crazy_name y new_crazy_name_dow new_crazy_name_decimal
#> <date> <dbl> <fct> <dbl>
#> 1 2022-07-02 1.02 Sat 2022.
#> 2 2022-07-03 -0.0539 Sun 2023.
#> 3 2022-07-04 -0.0120 Mon 2023.
#> 4 2022-07-05 -2.43 Tue 2023.
#> 5 2022-07-06 1.55 Wed 2023.
#> 6 2022-07-07 -0.0889 Thu 2023.
#> 7 2022-07-08 -0.735 Fri 2023.
#> 8 2022-07-09 0.208 Sat 2023.
#> 9 2022-07-10 -0.295 Sun 2023.
#> 10 2022-07-11 -0.131 Mon 2023.
Thanks @EmilHvitfeldt for the quick response. I'm aware of all the behavior you point out, but I still maintain that allowing finer control over the prefixes would be a significant improvement. I'm specifically trying to avoid building a regex using the (potentially very many) names of all the holidays. Simply being able to force the prefix to be "holiday_" would make things much simpler.
Do you find this problem to be specific to step_dummy()
and step_holiday()
or is it a more general problem?
I encounter it mostly with step_holiday()
but I made my suggestion a little more general because I wasn't sure if there would be a preference adding something like this for just one step, or doing something more general.