recipes icon indicating copy to clipboard operation
recipes copied to clipboard

allow `role` argument to specify multiple roles at once

Open gacolitti opened this issue 5 years ago • 3 comments
trafficstars

Is it possible for derived variables to have multiple roles?

This doesn't work:

recipe(HHV ~ ., data = biomass) %>% 
  step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>% 
  add_role(has_role("new"), new_role = "new2") %>% 
  prep() %>% 
  summary()

You also can't do this:

recipe(HHV ~ ., data = biomass) %>% 
  step_mutate(carbon_sqr = carbon ^ 2, role = c("new", "new2")) %>% 
  prep() %>% 
  summary()

But this works just fine:

recipe(HHV ~ ., data = biomass) %>% 
  step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>% 
  step_interact(terms = ~ has_role("new"):carbon) %>% 
  prep() %>% 
  summary()

How can I add multiple roles to derived variables?

gacolitti avatar Dec 26 '19 23:12 gacolitti

Adding to the bug. Here's another example.

Assume you want to convert one or more integer variables into dummies, due to a null hypothesis that there's a difference between the individual levels. This is easily achieved.

data(mtcars)
library(recipes)
library(dplyr)
rec <- recipe(mtcars, hp ~ cyl + mpg) %>%
  step_integer(cyl) %>%
  step_num2factor(cyl, levels = c('4', '6', '8')) %>% 
  step_dummy(cyl)

Now if we have more predictors, we may be interested in standardizing the numeric predictors, but to keep the interpretation of our dummy variables we want to avoid standardizing these. The dummy variables are predictors, so we want to keep them as such and add an additional dummy role and use this to subset the predictors.

rec %>% 
  add_role(starts_with('cyl'), 
           new_role = 'dummy') %>%
  step_normalize(all_numeric(), -has_role('dummy')) %>% 
  prep() %>%
  juice()

# A tibble: 32 x 4
      mpg    hp cyl_X6 cyl_X8
    <dbl> <dbl>  <dbl>  <dbl>
 1  0.151   110  1.86  -0.868
 2  0.151   110  1.86  -0.868
 3  0.450    93 -0.521 -0.868
 4  0.217   110  1.86  -0.868
 5 -0.231   175 -0.521  1.12 
 6 -0.330   105  1.86  -0.868
 7 -0.961   245 -0.521  1.12 
 8  0.715    62 -0.521 -0.868
 9  0.450    95 -0.521 -0.868
10 -0.148   123  1.86  -0.868
# ... with 22 more rows

Now that was unexpected. From this we can see the dummy variables were standardized despite deselecting using -has_role('dummy').

The problem again is that add_role doesn't catch the the new dummy columns. One could avoid this either by adding the role directly in step_dummy or by using starts_with to de-select within step_normalize for example:

rec %>% 
  step_normalize(all_numeric(), - starts_with('cyl')) %>%
  prep() %>%
  juice()

but from a user-perspective this is less readable especially if you have multiple variables, such that you need multiple starts_with within a single step_*.

Bijaelo avatar Sep 20 '20 13:09 Bijaelo

Adding what I found. Derived variables don't appear in the summary(recipe) output, and so when trying to use add_role() or update_role() there's nothing to work with

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(recipes)
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

rec <-
  recipe( ~ ., data = iris) %>%
  step_mutate(
    dbl_width = Sepal.Width * 2,
    half_length = Sepal.Length / 2
  )

# the new, derived variables aren't present
rec %>% 
  summary()
#> # A tibble: 5 x 4
#>   variable     type    role      source  
#>   <chr>        <chr>   <chr>     <chr>   
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width  numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width  numeric predictor original
#> 5 Species      nominal predictor original

# can't manage the roles for derived variables
rec %>% 
  add_role(dbl_width, half_length, new_role="special_role")
#> Error: Can't subset columns that don't exist.
#> x Column `dbl_width` doesn't exist.
rec %>% 
  update_role(dbl_width, half_length, new_role="special_role")
#> Error: Can't subset columns that don't exist.
#> x Column `dbl_width` doesn't exist.

Created on 2021-08-04 by the reprex package (v2.0.0)

samyishak avatar Aug 04 '21 23:08 samyishak

Here is a short overview of why different thing doesn't work

recipe(HHV ~ ., data = biomass) %>% 
  step_mutate(carbon_sqr = carbon ^ 2, role = "new") %>% 
  add_role(has_role("new"), new_role = "new2") %>% 
  prep() %>% 
  summary()

This one is hidden right now, but add_role() currently only works on variables in the original data set https://github.com/tidymodels/recipes/blob/ab2405a0393bba06d9d7a52b4dbba6659a6dfcbd/R/roles.R#L132

recipe(HHV ~ ., data = biomass) %>% 
  step_mutate(carbon_sqr = carbon ^ 2, role = c("new", "new2")) %>% 
  prep() %>% 
  summary()

Right now this appears to be a bug with bad error message. add_role() and update_role() both enforce that new_role() should be of length 1. That same messaging should be carried over to the role argument in the steps.

I don't see much harm in letting add_role(), update_role() and the role arguments take vectors of any length. It might require a bigger rewrite.

@Bijaelo your issue is the same as the first one here. I would recommend in cases such as these to set the role of the resulting variables in step_dummy() instead of setting them after with add_role()

library(recipes)

recipe(mtcars, hp ~ cyl + mpg) %>%
  step_integer(cyl) %>%
  step_num2factor(cyl, levels = c('4', '6', '8')) %>% 
  step_dummy(cyl, role = "dummy") %>% 
  step_normalize(all_numeric(), - has_role('dummy')) %>% 
  prep() %>%
  bake(new_data = NULL)
#> # A tibble: 32 × 4
#>       mpg     hp cyl_X6 cyl_X8
#>     <dbl>  <dbl>  <dbl>  <dbl>
#>  1  0.151 -0.535      1      0
#>  2  0.151 -0.535      1      0
#>  3  0.450 -0.783      0      0
#>  4  0.217 -0.535      1      0
#>  5 -0.231  0.413      0      1
#>  6 -0.330 -0.608      1      0
#>  7 -0.961  1.43       0      1
#>  8  0.715 -1.24       0      0
#>  9  0.450 -0.754      0      0
#> 10 -0.148 -0.345      1      0
#> # … with 22 more rows

Last point by @samyishak is a little off, summary() on the un-prepped recipe doesn't know what variables will be derived yet. Once it is prepped you can see them

library(recipes)

rec <-
  recipe( ~ ., data = iris) %>%
  step_mutate(
    dbl_width = Sepal.Width * 2,
    half_length = Sepal.Length / 2
  ) 

rec_prep <- prep(rec)

summary(rec)
#> # A tibble: 5 × 4
#>   variable     type    role      source  
#>   <chr>        <chr>   <chr>     <chr>   
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width  numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width  numeric predictor original
#> 5 Species      nominal predictor original
summary(rec_prep)
#> # A tibble: 7 × 4
#>   variable     type    role      source  
#>   <chr>        <chr>   <chr>     <chr>   
#> 1 Sepal.Length numeric predictor original
#> 2 Sepal.Width  numeric predictor original
#> 3 Petal.Length numeric predictor original
#> 4 Petal.Width  numeric predictor original
#> 5 Species      nominal predictor original
#> 6 dbl_width    numeric predictor derived 
#> 7 half_length  numeric predictor derived

EmilHvitfeldt avatar Apr 14 '22 19:04 EmilHvitfeldt