rSAFE icon indicating copy to clipboard operation
rSAFE copied to clipboard

Variable roles in tidymodels recipe and workflow... are they respected by rSAFE?

Open jacekkotowski opened this issue 2 years ago • 3 comments

Example (I am playing with bicycle demand data from Kaggle

bike_recipe <- recipe(count ~ . , data = bike_training) %>%
  step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
   update_role("datetime", new_role = "id_variable") %>%
    step_rm("atemp")

will create time features out of the datetime index and then datetime will not take part in modelling. I also removed "atemp" variable altogether (temp and atemp were strongly correlated). It is not taking part in the modelling either.

Next I run the explainer:

explainer <- explain_tidymodels(bike_final_fit, data = bike_all %>% select(-count), y = bike_all$count) safe_extractor <- safe_extraction(explainer)

Safe extractor seems to ignore the lack of datetime and atemp in modelling process and proposes:

 Variable 'datetime' - selected intervals:
	(-Inf, 2011-02-16 23:00:00]
 	(2011-02-16 23:00:00, 2011-06-17 23:00:00]
 	(2011-06-17 23:00:00, 2012-04-15 23:00:00]
 	(2012-04-15 23:00:00, 2012-07-08 23:00:00]
 	(2012-07-08 23:00:00, Inf)
Variable 'season' - selected intervals:
	(-Inf, 3]
 	(3, Inf)
Variable 'holiday' - no transformation suggested.
Variable 'workingday' - no transformation suggested.
Variable 'weather' - selected intervals:
	(-Inf, 1]
 	(1, Inf)
Variable 'temp' - selected intervals:
	(-Inf, 12.3]
 	(12.3, 22.96]
 	(22.96, Inf)
Variable 'atemp' - selected intervals:
	(-Inf, 24.24]
 	(24.24, Inf)
Variable 'humidity' - selected intervals:
	(-Inf, 30]
 	(30, 48]
 	(48, 67]
 	(67, 84]
 	(84, Inf)
Variable 'windspeed' - selected intervals:
	(-Inf, 7.0015]
 	(7.0015, Inf)

How to tell rSAFE these two vars (one is time index another has been removed in the bake) are not taking part? I am attaching my quick and dirty workflow:

timeseries_modelling_xgboost_short.zip @agosiewska

jacekkotowski avatar Jun 22 '22 10:06 jacekkotowski

I believe it is a matter of how DALEX treats the datasets in the explainer, could you, please prepare a reproducible example and share session info?

agosiewska avatar Jun 22 '22 14:06 agosiewska

I attached a rendered html and rmd file with my analysis and session info at the bottom. timeseries_modelling_xgboost_short _2922_06_23a.zip

Is it ok just to ignore from the output the variables that did not take part in modelling? And do the data transformation with the existing variables as they are? Or these excluded variables have impact on all the break points in the variables?

My session info:

R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=C                     
system code page: 65001

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] shiny_1.7.1

loaded via a namespace (and not attached):
  [1] colorspace_2.0-3   ellipsis_0.3.2     class_7.3-20       timetk_2.8.0      
  [5] base64enc_0.1-3    fs_1.5.2           rstudioapi_0.13    listenv_0.8.0     
  [9] furrr_0.3.0        farver_2.1.0       dials_0.1.1        DT_0.23           
 [13] prodlim_2019.11.13 fansi_1.0.3        lubridate_1.8.0    codetools_0.2-18  
 [17] splines_4.1.3      R.methodsS3_1.8.1  doParallel_1.0.17  cachem_1.0.6      
 [21] knitr_1.39         polyclip_1.10-0    jsonlite_1.8.0     workflows_0.2.6   
 [25] pROC_1.18.0        R.oo_1.24.0        yardstick_0.0.9    ggforce_0.3.3     
 [29] tune_0.2.0         clipr_0.8.0        compiler_4.1.3     assertthat_0.2.1  
 [33] Matrix_1.4-1       fastmap_1.1.0      cli_3.3.0          later_1.3.0       
 [37] tweenr_1.0.2       htmltools_0.5.2    tools_4.1.3        gtable_0.3.0      
 [41] glue_1.6.2         dplyr_1.0.9        Rcpp_1.0.8.3       jquerylib_0.1.4   
 [45] styler_1.7.0       DiceDesign_1.9     vctrs_0.4.1        iterators_1.0.14  
 [49] parsnip_0.2.1      timeDate_3043.102  gower_1.0.0        xfun_0.31         
 [53] globals_0.15.0     mime_0.12          miniUI_0.1.1.1     lifecycle_1.0.1   
 [57] pacman_0.5.1       future_1.26.1      MASS_7.3-57        zoo_1.8-10        
 [61] scales_1.2.0       ipred_0.9-12       promises_1.2.0.1   parallel_4.1.3    
 [65] yaml_2.3.5         ggplot2_3.3.6      sass_0.4.1         rpart_4.1.16      
 [69] corrplot_0.92      foreach_1.5.2      lhs_1.1.5          hardhat_0.2.0     
 [73] lava_1.6.10        repr_1.1.4         rlang_1.0.2        pkgconfig_2.0.3   
 [77] rsample_0.1.1      evaluate_0.15      lattice_0.20-45    purrr_0.3.4       
 [81] recipes_0.2.0      htmlwidgets_1.5.4  tidyselect_1.1.2   parallelly_1.31.1 
 [85] plyr_1.8.7         magrittr_2.0.3     R6_2.5.1           generics_0.1.2    
 [89] DBI_1.1.2          pillar_1.7.0       withr_2.5.0        xts_0.12.1        
 [93] survival_3.3-1     DALEX_2.4.2        nnet_7.3-17        tibble_3.1.7      
 [97] future.apply_1.9.0 crayon_1.5.1       xgboost_1.6.0.1    utf8_1.2.2        
[101] rmarkdown_2.14     grid_4.1.3         data.table_1.14.2  reprex_2.0.1      
[105] digest_0.6.29      xtable_1.8-4       R.cache_0.15.0     tidyr_1.2.0       
[109] httpuv_1.6.5       R.utils_2.11.0     GPfit_1.0-8        munsell_0.5.0     
[113] finetune_0.2.0     skimr_2.1.4        bslib_0.3.1  

jacekkotowski avatar Jun 23 '22 08:06 jacekkotowski

Thank you, by reproducible example, I meant some toy example that is simple and fast to run, this .Rmd is taking a lot of time to compute and when I decreased the number of trees in xgboost to speed the script up I got an error:

> bike_rf_rs <-
+   bike_rf_wkfl %>%
+     finetune::tune_sim_anneal(
+     resamples = bike_folds,
+    param_info = xgboost_set,
+       metrics = bike_metrics,
+          iter = 30,
+       initial = 10)

>  Generating a set of 10 initial parameter results
<U+221A> Initialization complete

Error in UseMethod("mutate") : 
  no applicable method for 'mutate' applied to an object of class "NULL"

Anyway, if you pass the data frame with all columns (bike_all) to the DALEX::explainer, SAFE will compute transformations for all of them. However, as long as you don't use interactions in SAFE (I saw in the script that you don't), then you can ignore the transformations for columns not used by the model. They are calculated for each variable independently.

Variable filtering perhaps should be a feature in a future version of SAFE. At this point, I would suggest filtering out variables before feeding data into the explainer.

agosiewska avatar Jun 23 '22 16:06 agosiewska