recipes icon indicating copy to clipboard operation
recipes copied to clipboard

[FR] Method to provide column to column mappings for steps that add/remove columns

Open mattwarkentin opened this issue 3 years ago • 3 comments
trafficstars

Hi,

Many {recipes} steps modify-in-place, so that the original column is modified in some way. A subset of steps will (possibly) add or remove columns (e.g. step_pca(), step_ns(), step_dummy(), etc.). I was wondering if it makes sense to have the tidy() methods (or some new generic) for these types of steps always return the mapping from current to new column names. I think this way one could traverse the recipe steps and determine the mapping from original variables/selectors to the final columns. This could be valuable for a few reasons; see https://github.com/tidymodels/parsnip/issues/595 as one example.

The tidy() method for step_dummy() is close to providing this information:

library(recipes)

recipe(Sepal.Length ~ Species, iris) %>% 
  step_dummy(Species) %>% 
  prep() %>% 
  tidy(number = 1)
#> # A tibble: 2 × 3
#>   terms   columns    id         
#>   <chr>   <chr>      <chr>      
#> 1 Species versicolor dummy_gGfxp
#> 2 Species virginica  dummy_gGfxp

As a side, I might expect the columns values to be Species_versicolor and Species_virginica as these are the actual new column names.

Skimming the available steps, it seems like the following steps can possibly affect columns by either addition/removal:

  • step_bs() / step_ns() / step_poly()
  • step_harmonic()
  • step_mutate() / step_mutate_at()
  • step_count()
  • step_date()
  • step_dummy() / step_dummy_extract() / step_dummy_multichoice()
  • step_holiday()
  • step_inidcate_na()
  • step_regex()
  • step_interact()
  • step_select() / step_rm()
  • step_zv() / step_nzv()
  • step_corr() / step_lincomb()
  • step_profile()
  • step_intercept()
  • step_filter_missing()
  • step_ratio()
  • step_pca() / step_pls() / step_nnmf_*() / step_kpca_*() / step_isomap() / step_ica
  • step_geodist()
  • step_depth() / step_classdist()

Okay, this is actually more than I thought, but I still think there's value in being able to traverse the steps to map original to final columns. Since tidy methods are already spoken for, maybe a new generic + methods would be valuable in this situation.

What do you think? I look forward to your thoughts.

mattwarkentin avatar Feb 23 '22 16:02 mattwarkentin

We've had some discussion on this at various times, such as in #604, #611, and others. So far, we have chosen not to keep around and expose a complete mapping from original column name to new column name for recipe transformations, because of efficiency and other practical reasons. (Instead, the selection is just executed.)

You mention sparse group lasso, which is helpful as a specific example of a use case. Have you run into wanting this kind of info in other contexts?

juliasilge avatar Mar 09 '22 16:03 juliasilge

Hmm, I don't have another motivating example off the top of my head. Thinking about how to implement SGL was the first time I realized we would need a column-to-column mapping to implement this model effectively. Maybe this is not motivating enough to build out this functionality in recipes.

mattwarkentin avatar Mar 09 '22 17:03 mattwarkentin

We've had this come up in the past and thought about it a lot. The problem is that it is really difficult to make that translation.

Here's an example: suppose you were to use some sparse PCA step to extract features. Some predictors might not be used at all so saying that input1 - input10 maps to pca1 - pca3 wouldn't be really correct. I'm not sure if that level of specificity is needed in all cases though.

That said, we should think about keeping track of inputs and outputs per step (although this isn't completely easy either). prep(object, log_changes = TRUE) does this but doesn't save the data:

library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())

data(ames, package = "modeldata")

library(dplyr)

ames <- mutate(ames, Sale_Price = log10(Sale_Price))

ames_rec <-
  recipe(
    Sale_Price ~ Longitude + Latitude + Neighborhood + Year_Built + Central_Air,
    data = ames
  ) %>%
  step_other(Neighborhood, threshold = 0.05) %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_ns(Longitude, Latitude, deg_free = 5)

res <- prep(ames_rec, log_changes = TRUE)
#> step_other (other_Xilol): same number of columns
#> 
#> step_dummy (dummy_qSvlS): 
#>  new (9): Neighborhood_College_Creek, Neighborhood_Old_Town, ...
#>  removed (2): Neighborhood, Central_Air
#> 
#> step_interact (interact_gwrkB): 
#>  new (1): Central_Air_Y_x_Year_Built
#> 
#> step_ns (ns_vlGB4): 
#>  new (10): Longitude_ns_1, Longitude_ns_2, Longitude_ns_3, ...
#>  removed (2): Longitude, Latitude

Created on 2022-03-10 by the reprex package (v2.0.1)

It doesn't go as far as saying Neighborhood generated Neighborhood_College_Creek, Neighborhood_Old_Town, Neighborhood_Edwards, and so on. And, in the case of the interaction, doesn't easily know what columns went into producing Central_Air_Y_x_Year_Built.

One final detail is that we've love to be able to do this since, when we are woking on feature selection, we'd like to be able to aggregate the feature-level importance to a column-level importance. Perhaps there is a fancy way of doing it, but I suspect the least incorrect method for determining the mapping would require changes to each step's prep() code.

topepo avatar Mar 10 '22 14:03 topepo

closing in favor of https://github.com/tidymodels/recipes/issues/1158

EmilHvitfeldt avatar May 26 '24 03:05 EmilHvitfeldt

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Jun 10 '24 00:06 github-actions[bot]