recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Allow finer control over variable name prefixes

Open joranE opened this issue 2 years ago • 4 comments

For steps that create derived variables, such as from a date, the new variables are prefixed with the date column name:

library(tidymodels)

d <- data.frame(y = rnorm(10),
                date = Sys.Date() + 1:10)

rec <- recipe(y~.,data = d) %>%
	step_date(date,features = c("dow","decimal")) %>%
	step_holiday(date,holidays = c("LaborDay","ChristmasDay"))

rec %>%
 prep() %>%
 bake(new_data = NULL)

Often you will want to do further processing on these variables, but selecting specific sets of them using tidyselect is challenging because they all get the same prefix, "date_". For example, we might want to use step_dummy() and step_bin2factor() on the day of week and holiday features, but selecting those, but not numeric ones like decimal becomes pretty tedious, particularly when there are lots of date features.

joranE avatar Jul 01 '22 18:07 joranE

Hello @joranE 👋

You are right that the columns coming out of step_holiday() can be a little hard to match as they don't have any holidays, they do however match the names of the holiday you passed to holidays, making it so you can use contains() to match them.

As for the "day of the week", those do get an infix of _dow_ so you can match them with date_dow_.

library(tidymodels)

d <- data.frame(y = rnorm(10),
                date = Sys.Date() + 1:10)

my_holidays <- c("LaborDay", "ChristmasDay")

rec <- recipe(y~.,data = d) %>%
  step_date(date,features = c("dow", "decimal")) %>%
  step_holiday(date,holidays = my_holidays) %>%
  step_dummy(starts_with("date_dow_")) %>%
  step_bin2factor(contains(my_holidays))

rec %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 10 × 11
#>    date            y date_decimal date_LaborDay date_ChristmasDay date_dow_Mon
#>    <date>      <dbl>        <dbl> <fct>         <fct>                    <dbl>
#>  1 2022-07-02  0.438        2022. no            no                           0
#>  2 2022-07-03 -0.653        2023. no            no                           0
#>  3 2022-07-04 -0.648        2023. no            no                           1
#>  4 2022-07-05 -0.706        2023. no            no                           0
#>  5 2022-07-06  0.174        2023. no            no                           0
#>  6 2022-07-07  0.636        2023. no            no                           0
#>  7 2022-07-08 -1.88         2023. no            no                           0
#>  8 2022-07-09  0.936        2023. no            no                           0
#>  9 2022-07-10  1.54         2023. no            no                           0
#> 10 2022-07-11  0.928        2023. no            no                           1
#> # … with 5 more variables: date_dow_Tue <dbl>, date_dow_Wed <dbl>,
#> #   date_dow_Thu <dbl>, date_dow_Fri <dbl>, date_dow_Sat <dbl>

Also, because this reprex doesn't show it. The prefix here isn't date because we are working with dates, it is date because that was the name of the variable we applied it to. step_date() and step_holiday() uses the variable name as prefix.

library(tidymodels)

d <- data.frame(y = rnorm(10),
                new_crazy_name = Sys.Date() + 1:10)

recipe(y ~ ., data = d) %>%
  step_date(new_crazy_name, features = c("dow", "decimal")) %>%
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 10 × 4
#>    new_crazy_name       y new_crazy_name_dow new_crazy_name_decimal
#>    <date>           <dbl> <fct>                               <dbl>
#>  1 2022-07-02      1.02   Sat                                 2022.
#>  2 2022-07-03     -0.0539 Sun                                 2023.
#>  3 2022-07-04     -0.0120 Mon                                 2023.
#>  4 2022-07-05     -2.43   Tue                                 2023.
#>  5 2022-07-06      1.55   Wed                                 2023.
#>  6 2022-07-07     -0.0889 Thu                                 2023.
#>  7 2022-07-08     -0.735  Fri                                 2023.
#>  8 2022-07-09      0.208  Sat                                 2023.
#>  9 2022-07-10     -0.295  Sun                                 2023.
#> 10 2022-07-11     -0.131  Mon                                 2023.

EmilHvitfeldt avatar Jul 01 '22 18:07 EmilHvitfeldt

Thanks @EmilHvitfeldt for the quick response. I'm aware of all the behavior you point out, but I still maintain that allowing finer control over the prefixes would be a significant improvement. I'm specifically trying to avoid building a regex using the (potentially very many) names of all the holidays. Simply being able to force the prefix to be "holiday_" would make things much simpler.

joranE avatar Jul 01 '22 19:07 joranE

Do you find this problem to be specific to step_dummy() and step_holiday() or is it a more general problem?

EmilHvitfeldt avatar Jul 01 '22 21:07 EmilHvitfeldt

I encounter it mostly with step_holiday() but I made my suggestion a little more general because I wasn't sure if there would be a preference adding something like this for just one step, or doing something more general.

joranE avatar Jul 02 '22 23:07 joranE