rSAFE
rSAFE copied to clipboard
Variable roles in tidymodels recipe and workflow... are they respected by rSAFE?
Example (I am playing with bicycle demand data from Kaggle
bike_recipe <- recipe(count ~ . , data = bike_training) %>%
step_date(datetime, features = c("doy", "dow", "month", "year"), abbr = TRUE) %>%
update_role("datetime", new_role = "id_variable") %>%
step_rm("atemp")
will create time features out of the datetime index and then datetime will not take part in modelling. I also removed "atemp" variable altogether (temp and atemp were strongly correlated). It is not taking part in the modelling either.
Next I run the explainer:
explainer <- explain_tidymodels(bike_final_fit, data = bike_all %>% select(-count), y = bike_all$count) safe_extractor <- safe_extraction(explainer)
Safe extractor seems to ignore the lack of datetime and atemp in modelling process and proposes:
Variable 'datetime' - selected intervals:
(-Inf, 2011-02-16 23:00:00]
(2011-02-16 23:00:00, 2011-06-17 23:00:00]
(2011-06-17 23:00:00, 2012-04-15 23:00:00]
(2012-04-15 23:00:00, 2012-07-08 23:00:00]
(2012-07-08 23:00:00, Inf)
Variable 'season' - selected intervals:
(-Inf, 3]
(3, Inf)
Variable 'holiday' - no transformation suggested.
Variable 'workingday' - no transformation suggested.
Variable 'weather' - selected intervals:
(-Inf, 1]
(1, Inf)
Variable 'temp' - selected intervals:
(-Inf, 12.3]
(12.3, 22.96]
(22.96, Inf)
Variable 'atemp' - selected intervals:
(-Inf, 24.24]
(24.24, Inf)
Variable 'humidity' - selected intervals:
(-Inf, 30]
(30, 48]
(48, 67]
(67, 84]
(84, Inf)
Variable 'windspeed' - selected intervals:
(-Inf, 7.0015]
(7.0015, Inf)
How to tell rSAFE these two vars (one is time index another has been removed in the bake) are not taking part? I am attaching my quick and dirty workflow:
timeseries_modelling_xgboost_short.zip @agosiewska
I believe it is a matter of how DALEX treats the datasets in the explainer, could you, please prepare a reproducible example and share session info?
I attached a rendered html and rmd file with my analysis and session info at the bottom. timeseries_modelling_xgboost_short _2922_06_23a.zip
Is it ok just to ignore from the output the variables that did not take part in modelling? And do the data transformation with the existing variables as they are? Or these excluded variables have impact on all the break points in the variables?
My session info:
R version 4.1.3 (2022-03-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Polish_Poland.1250 LC_CTYPE=Polish_Poland.1250
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C
[5] LC_TIME=C
system code page: 65001
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] shiny_1.7.1
loaded via a namespace (and not attached):
[1] colorspace_2.0-3 ellipsis_0.3.2 class_7.3-20 timetk_2.8.0
[5] base64enc_0.1-3 fs_1.5.2 rstudioapi_0.13 listenv_0.8.0
[9] furrr_0.3.0 farver_2.1.0 dials_0.1.1 DT_0.23
[13] prodlim_2019.11.13 fansi_1.0.3 lubridate_1.8.0 codetools_0.2-18
[17] splines_4.1.3 R.methodsS3_1.8.1 doParallel_1.0.17 cachem_1.0.6
[21] knitr_1.39 polyclip_1.10-0 jsonlite_1.8.0 workflows_0.2.6
[25] pROC_1.18.0 R.oo_1.24.0 yardstick_0.0.9 ggforce_0.3.3
[29] tune_0.2.0 clipr_0.8.0 compiler_4.1.3 assertthat_0.2.1
[33] Matrix_1.4-1 fastmap_1.1.0 cli_3.3.0 later_1.3.0
[37] tweenr_1.0.2 htmltools_0.5.2 tools_4.1.3 gtable_0.3.0
[41] glue_1.6.2 dplyr_1.0.9 Rcpp_1.0.8.3 jquerylib_0.1.4
[45] styler_1.7.0 DiceDesign_1.9 vctrs_0.4.1 iterators_1.0.14
[49] parsnip_0.2.1 timeDate_3043.102 gower_1.0.0 xfun_0.31
[53] globals_0.15.0 mime_0.12 miniUI_0.1.1.1 lifecycle_1.0.1
[57] pacman_0.5.1 future_1.26.1 MASS_7.3-57 zoo_1.8-10
[61] scales_1.2.0 ipred_0.9-12 promises_1.2.0.1 parallel_4.1.3
[65] yaml_2.3.5 ggplot2_3.3.6 sass_0.4.1 rpart_4.1.16
[69] corrplot_0.92 foreach_1.5.2 lhs_1.1.5 hardhat_0.2.0
[73] lava_1.6.10 repr_1.1.4 rlang_1.0.2 pkgconfig_2.0.3
[77] rsample_0.1.1 evaluate_0.15 lattice_0.20-45 purrr_0.3.4
[81] recipes_0.2.0 htmlwidgets_1.5.4 tidyselect_1.1.2 parallelly_1.31.1
[85] plyr_1.8.7 magrittr_2.0.3 R6_2.5.1 generics_0.1.2
[89] DBI_1.1.2 pillar_1.7.0 withr_2.5.0 xts_0.12.1
[93] survival_3.3-1 DALEX_2.4.2 nnet_7.3-17 tibble_3.1.7
[97] future.apply_1.9.0 crayon_1.5.1 xgboost_1.6.0.1 utf8_1.2.2
[101] rmarkdown_2.14 grid_4.1.3 data.table_1.14.2 reprex_2.0.1
[105] digest_0.6.29 xtable_1.8-4 R.cache_0.15.0 tidyr_1.2.0
[109] httpuv_1.6.5 R.utils_2.11.0 GPfit_1.0-8 munsell_0.5.0
[113] finetune_0.2.0 skimr_2.1.4 bslib_0.3.1
Thank you, by reproducible example, I meant some toy example that is simple and fast to run, this .Rmd is taking a lot of time to compute and when I decreased the number of trees in xgboost to speed the script up I got an error:
> bike_rf_rs <-
+ bike_rf_wkfl %>%
+ finetune::tune_sim_anneal(
+ resamples = bike_folds,
+ param_info = xgboost_set,
+ metrics = bike_metrics,
+ iter = 30,
+ initial = 10)
> Generating a set of 10 initial parameter results
<U+221A> Initialization complete
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "NULL"
Anyway, if you pass the data frame with all columns (bike_all
) to the DALEX::explainer, SAFE will compute transformations for all of them.
However, as long as you don't use interactions in SAFE (I saw in the script that you don't), then you can ignore the transformations for columns not used by the model. They are calculated for each variable independently.
Variable filtering perhaps should be a feature in a future version of SAFE. At this point, I would suggest filtering out variables before feeding data into the explainer.