recipes
recipes copied to clipboard
allow `role` argument to specify multiple roles at once
Is it possible for derived variables to have multiple roles?
This doesn't work:
recipe(HHV ~ ., data = biomass) %>%
step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>%
add_role(has_role("new"), new_role = "new2") %>%
prep() %>%
summary()
You also can't do this:
recipe(HHV ~ ., data = biomass) %>%
step_mutate(carbon_sqr = carbon ^ 2, role = c("new", "new2")) %>%
prep() %>%
summary()
But this works just fine:
recipe(HHV ~ ., data = biomass) %>%
step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>%
step_interact(terms = ~ has_role("new"):carbon) %>%
prep() %>%
summary()
How can I add multiple roles to derived variables?
Adding to the bug. Here's another example.
Assume you want to convert one or more integer variables into dummies, due to a null hypothesis that there's a difference between the individual levels. This is easily achieved.
data(mtcars)
library(recipes)
library(dplyr)
rec <- recipe(mtcars, hp ~ cyl + mpg) %>%
step_integer(cyl) %>%
step_num2factor(cyl, levels = c('4', '6', '8')) %>%
step_dummy(cyl)
Now if we have more predictors, we may be interested in standardizing the numeric predictors, but to keep the interpretation of our dummy variables we want to avoid standardizing these. The dummy variables are predictors, so we want to keep them as such and add an additional dummy role and use this to subset the predictors.
rec %>%
add_role(starts_with('cyl'),
new_role = 'dummy') %>%
step_normalize(all_numeric(), -has_role('dummy')) %>%
prep() %>%
juice()
# A tibble: 32 x 4
mpg hp cyl_X6 cyl_X8
<dbl> <dbl> <dbl> <dbl>
1 0.151 110 1.86 -0.868
2 0.151 110 1.86 -0.868
3 0.450 93 -0.521 -0.868
4 0.217 110 1.86 -0.868
5 -0.231 175 -0.521 1.12
6 -0.330 105 1.86 -0.868
7 -0.961 245 -0.521 1.12
8 0.715 62 -0.521 -0.868
9 0.450 95 -0.521 -0.868
10 -0.148 123 1.86 -0.868
# ... with 22 more rows
Now that was unexpected. From this we can see the dummy variables were standardized despite deselecting using -has_role('dummy').
The problem again is that add_role doesn't catch the the new dummy columns. One could avoid this either by adding the role directly in step_dummy or by using starts_with to de-select within step_normalize for example:
rec %>%
step_normalize(all_numeric(), - starts_with('cyl')) %>%
prep() %>%
juice()
but from a user-perspective this is less readable especially if you have multiple variables, such that you need multiple starts_with within a single step_*.
Adding what I found. Derived variables don't appear in the summary(recipe) output, and so when trying to use add_role() or update_role() there's nothing to work with
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(recipes)
#>
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#>
#> step
rec <-
recipe( ~ ., data = iris) %>%
step_mutate(
dbl_width = Sepal.Width * 2,
half_length = Sepal.Length / 2
)
# the new, derived variables aren't present
rec %>%
summary()
#> # A tibble: 5 x 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width numeric predictor original
#> 5 Species nominal predictor original
# can't manage the roles for derived variables
rec %>%
add_role(dbl_width, half_length, new_role="special_role")
#> Error: Can't subset columns that don't exist.
#> x Column `dbl_width` doesn't exist.
rec %>%
update_role(dbl_width, half_length, new_role="special_role")
#> Error: Can't subset columns that don't exist.
#> x Column `dbl_width` doesn't exist.
Created on 2021-08-04 by the reprex package (v2.0.0)
Here is a short overview of why different thing doesn't work
recipe(HHV ~ ., data = biomass) %>%
step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>%
add_role(has_role("new"), new_role = "new2") %>%
prep() %>%
summary()
This one is hidden right now, but add_role() currently only works on variables in the original data set
https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132
recipe(HHV ~ ., data = biomass) %>%
step_mutate(carbon_sqr = carbon ^ 2, role = c("new", "new2")) %>%
prep() %>%
summary()
Right now this appears to be a bug with bad error message. add_role() and update_role() both enforce that new_role() should be of length 1. That same messaging should be carried over to the role argument in the steps.
I don't see much harm in letting add_role(), update_role() and the role arguments take vectors of any length. It might require a bigger rewrite.
@Bijaelo your issue is the same as the first one here. I would recommend in cases such as these to set the role of the resulting variables in step_dummy() instead of setting them after with add_role()
library(recipes)
recipe(mtcars, hp ~ cyl + mpg) %>%
step_integer(cyl) %>%
step_num2factor(cyl, levels = c('4', '6', '8')) %>%
step_dummy(cyl, role = "dummy") %>%
step_normalize(all_numeric(), - has_role('dummy')) %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 32 × 4
#> mpg hp cyl_X6 cyl_X8
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.151 -0.535 1 0
#> 2 0.151 -0.535 1 0
#> 3 0.450 -0.783 0 0
#> 4 0.217 -0.535 1 0
#> 5 -0.231 0.413 0 1
#> 6 -0.330 -0.608 1 0
#> 7 -0.961 1.43 0 1
#> 8 0.715 -1.24 0 0
#> 9 0.450 -0.754 0 0
#> 10 -0.148 -0.345 1 0
#> # … with 22 more rows
Last point by @samyishak is a little off, summary() on the un-prepped recipe doesn't know what variables will be derived yet. Once it is prepped you can see them
library(recipes)
rec <-
recipe( ~ ., data = iris) %>%
step_mutate(
dbl_width = Sepal.Width * 2,
half_length = Sepal.Length / 2
)
rec_prep <- prep(rec)
summary(rec)
#> # A tibble: 5 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width numeric predictor original
#> 5 Species nominal predictor original
summary(rec_prep)
#> # A tibble: 7 × 4
#> variable type role source
#> <chr> <chr> <chr> <chr>
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width numeric predictor original
#> 5 Species nominal predictor original
#> 6 dbl_width numeric predictor derived
#> 7 half_length numeric predictor derived