juliasilge.com
juliasilge.com copied to clipboard
Use racing methods to tune xgboost models and predict home runs | Julia Silge
Use racing methods to tune xgboost models and predict home runs | Julia Silge
Models like xgboost have many tuning hyperparameters, but racing methods can help identify parameter combinations that are not performing well.
I thought my computer was fast but tune_race_anova() showed me otherwise.
Hi Julia,
When I run the tune_race_anova
function I get the following error:
Creating pre-processing data to finalize unknown parameter: mtry
Racing will minimize the mn_log_loss metric.
Resamples are analyzed in a random order.
Error: There were no valid metrics for the ANOVA model.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning message:
All models failed. See the `.notes` column.
What am I doing wrong? I've followed the tutorial step by step so far so I suspect there is an issue with dependencies here?
@JunaidMB Hmmmmm, there are two things that come to mind: I know I was using the development version of dials from GitHub and there was a very recent version of finetune released to CRAN. I'd check to make sure you have both of those installed. I really have got to start adding session info to my blog posts. 😬
Hi Julia, It's a very useful tutorial. However, I wanted to point out that you've missed a "scales::" in the second code chunk, just before "percent" in the forth line. :)
Hi @julia and @JuniadMB,
I also experienced the exact same error in my workflow set tuning and I don't understand why?
wflwset_setup <- workflow_set(
preproc = list(
normalized = recipe_normal,
rm_corr = recipe_corr,
rm_unbalan = recipe_remove,
impute_mean = recipe_impute_mean,
impute_knn = recipe_impute_knn
),
models = list(
lm = lm_model.wf,
glm = glm_model.wf,
spline = spline_model.wf,
knn = knn_model.wf,
svm = svm_model.wf,
RF = rf_model.wf,
XGB = xgb_model.wf,
CatB = catboost_model.wf
),
cross = TRUE
)
```{r Tuning workflowset}
set.seed(579)
if (exists("wflwset_tune_results_cv")) rm("wflwset_tune_results_cv")
# Initializing parallel processing
doParallel::registerDoParallel()
# Workflowset tuning
wflwset_tune_results_cv <- wflwset_setup %>%
workflowsets::workflow_map(
fn = "tune_race_anova",
resamples = cv.fold.wf,
grid = 15,
metrics = multi.metric.wf, #
verbose = TRUE
)
# Terminating parallel session
parallelStop()
i No tuning parameters. `fit_resamples()` will be attempted
i 1 of 35 resampling: normalized_lm
Warning: All models failed. See the `.notes` column.
x 1 of 35 resampling: normalized_lm failed with preprocessor 1/1, model 1/1: Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...): 0 (non-NA) cases
i 2 of 35 tuning: normalized_glm
Warning: All models failed. See the `.notes` column.
x 2 of 35 tuning: normalized_glm failed with: There were no valid metrics for the ANOVA model.
i No tuning parameters. `fit_resamples()` will be attempted
i 3 of 35 resampling: normalized_knn
Warning: All models failed. See the `.notes` column.
x 3 of 35 resampling: normalized_knn failed with preprocessor 1/1, model 1/1: Error in best[1, 2]: subscript out of bounds
i No tuning parameters. `fit_resamples()` will be attempted
i 4 of 35 resampling: normalized_svm
Warning: All models failed. See the `.notes` column.
x 4 of 35 resampling: normalized_svm failed with preprocessor 1/1, model 1/1: Error in if (any(co)) {: missing value where TRUE/FALSE needed
i 5 of 35 tuning: normalized_RF
i Creating pre-processing data to finalize unknown parameter: mtry
@kamaulindhardt It looks like your models are failing, just to fit in the first place (which is why you can't then do an ANOVA model on the results). I would try fitting some of those workflows individually outside of the workflowset, to debug which one is the problem and why.
Thank you @juliasilge,
I am trying to fit the individual models separately and find it difficult to interpret the issue. As the error messages are, for example here with my knn model:
"Error: Problem with `mutate()` column `.row`. ℹ `.row = orig_rows`. ℹ `.row` must be size 37 or 1, not 40."
What does that mean? I cannot find information online.
From the recipe:
base_recipe <-
recipe(formula = logRR ~ ., data = af.train.wf) %>%
update_role(Latitude,
Longitude,
new_role = "sample ID") %>%
step_zv(all_predictors(), skip = TRUE) %>% # remove any columns with a single unique value
step_normalize(all_numeric_predictors(), skip = TRUE) # normalize numeric data: standard deviation of one and a mean of zero.
filter_recipe <-
base_recipe %>%
step_corr(all_numeric_predictors(), threshold = 0.8, skip = TRUE)
Model spec
knn_spec <-
nearest_neighbor(neighbors = tune(),
weight_func = tune()) %>%
set_engine("kknn") %>%
set_mode("regression")
Model tuning with tune_grid()
knn_fit <- tune_grid(knn_spec,
preprocessor = filter_recipe,
resamples = cv.fold.wf,
metrics = multi.metric.wf)
knn_fit
Error(s):
Warning: This tuning result has notes. Example notes on model fitting include:
preprocessor 1/1, model 5/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 37 or 1, not 40.
preprocessor 1/1, model 1/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 37 or 1, not 40.
preprocessor 1/1, model 2/10 (predictions): Error: Problem with `mutate()` column `.row`.
ℹ `.row = orig_rows`.
ℹ `.row` must be size 39 or 1, not 40.
# Tuning results
# 10-fold cross-validation
It's hard to say without a reprex but I am guessing your problem is using skip = TRUE
here, where you are not applying some steps to new data. You can check out this discussion of what skipping steps for new data means.
I now added an imputation stepstep_impute_mean(all_predictors())
in the recipe, and that seems to work:
base_recipe <-
recipe(formula = logRR ~ ., data = af.train.wf) %>%
step_impute_mean(all_predictors())
update_role(Latitude,
Longitude,
new_role = "sample ID") %>%
step_zv(all_predictors(), skip = TRUE) %>% # remove any columns with a single unique value
step_normalize(all_numeric_predictors(), skip = TRUE) # normalize numeric data: standard deviation of one and a mean of zero.
filter_recipe <-
base_recipe %>%
step_corr(all_numeric_predictors(), threshold = 0.8, skip = TRUE)
How come Random Forest and kNN models cannot cope with missing values? I thought at least RF was desifor dealing with missing values. On the other hand my XGBoost models don't seem to be bothered..(?)
Thank you!
@kamaulindhardt Again, it's hard to say without a reprex but now it's look to me that you aren't using anything past step_impute_mean()
because you don't have a %>%
at the end of that line. This model is probably succeeding because you are no longer trying to use the skip = TRUE
steps; using skip = TRUE
for steps like step_normalize()
is a pretty bad idea. I suggest reading through the sections I linked above to understanding what skipping steps for new data means.
I also recommend creating a small, self-contained reproducible example to ask for help. Truly, people are just guessing if you don't do this. I know that creating a reprex can feel like a lot of work, but we have found that it is really the only way for someone who needs help online to reliably get the right answer. If you ask a question online without creating a reprex, think of yourself as just blindly flailing in the dark; when you ask a question with creating a reprex that demonstrates your problem, then think of yourself as having given people the tools to help you.
Hi Julia, I would like to know how to unfold the folds created with vfold_cv; for better inspection what samples are in each fold. Thanks
@data-datum You might find it helpful to use the tidy()
method, or to check out this article on handling rset
objects for examples on how to call analysis()
. Or you can manually get the indices out; they are in in_id
:
library(tidyverse)
library(rsample)
car_folds <- vfold_cv(mtcars, v = 3)
map(car_folds$splits, "in_id")
#> [[1]]
#> [1] 1 2 3 5 9 11 12 14 15 16 17 18 21 22 23 24 25 26 27 31 32
#>
#> [[2]]
#> [1] 1 2 4 6 7 8 9 10 11 12 13 14 17 19 20 22 23 28 29 30 32
#>
#> [[3]]
#> [1] 3 4 5 6 7 8 10 13 15 16 18 19 20 21 24 25 26 27 28 29 30 31
Created on 2021-10-28 by the reprex package (v2.0.1)
I too have the same issue when using racing to tune a few models
race_ctrl <-
control_race(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE,
verbose = TRUE,
pkgs = c('stringr')
)
race_results <-
system.time(
all_workflows %>%
workflow_map(
"tune_race_anova",
seed = 1503,
resamples = vfolds,
grid = 25,
verbose = TRUE,
control = race_ctrl
)
)
i 1 of 8 tuning: pca_norm_recipe_RF
i Creating pre-processing data to finalize unknown parameter: mtry
*** recursive gc invocation
Warning: stack imbalance in 'lapply', 154 then 152
x 1 of 8 tuning: pca_norm_recipe_RF failed with: There were no valid metrics for the ANOVA model.
i 2 of 8 tuning: pca_norm_recipe_boosting
Only successful when i switch out the racing to the standard tune_grid
grid_ctrl <-
control_grid(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE,
pkgs = c('stringr')
)
full_results_time <-
system.time(
grid_results <-
all_workflows %>%
workflow_map(
seed = 1503,
resamples = vfolds,
grid = 25,
control = grid_ctrl,
verbose = TRUE
)
)
i 1 of 8 tuning: pca_norm_recipe_RF
i Creating pre-processing data to finalize unknown parameter: mtry
v 1 of 8 tuning: pca_norm_recipe_RF (21m 29.6s)
i 2 of 8 tuning: pca_norm_recipe_boosting
Wow @tsengj I have not seen a garbage collection error from these functions. Can you create a reprex (a minimal reproducible example) for this and post it on the finetune repo? The goal of a reprex is to make it easier for us to recreate your problem so that we can understand it and/or fix it.
If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:
install.packages("reprex")
Thanks! 🙌
@juliasilge Turns out that removing the line "pkgs = c('stringr')" from control_race fixed the error above. The stringr package was a simple step_mutate recipe which does "postcode = as.numeric(str_sub(suburb,-4,-1)". Excluding that from the recipe resolved the issue above. I haven't had the opportunity to raise a reprex in the finetune repo. Doesn't appear as though finetune supports loading of package yet. I utilise parallel processing (doParallel)
Hi Julia,
Thank you for your valued contributions!
When I run the tune_race_anova function
on a workflow containing an XGboost model I also get the following error:
min_preproc_xgboost failed with: There were no valid metrics for the ANOVA model. All other models are ok. I've been able to run XGboost on the same machine using the approach below and it worked fine then.
I have a hard time debugging this one, do you have any ideas at what might cause this error?
I've made a reprex using the diamonds dataset and session info (hope it's done correctly as this is my first reprex).
Any help is much appreciated.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(tidyverse)
library(here)
#> here() starts at /private/var/folders/pw/540tsbnx2r3gtmk605nm1fsc0000gn/T/RtmpTkNsqu/reprex-381939e26d6f-sand-viper
library(baguette)
library(rules)
#>
#> Attaching package: 'rules'
#> The following object is masked from 'package:dials':
#>
#> max_rules
library(finetune)
library(dials)
options(tidymodels.dark = TRUE)
doParallel::registerDoParallel()
carat <- diamonds %>%
select(price, cut, carat, clarity)
## Build models
set.seed(123)
carat_split <- initial_split(carat, strata = price)
carat_train <- training(carat_split)
carat_test <- testing(carat_split)
set.seed(234)
carat_folds <- vfold_cv(carat_train, strata = price)
carat_folds
#> # 10-fold cross-validation using stratification
#> # A tibble: 10 × 2
#> splits id
#> <list> <chr>
#> 1 <split [36405/4048]> Fold01
#> 2 <split [36406/4047]> Fold02
#> 3 <split [36407/4046]> Fold03
#> 4 <split [36408/4045]> Fold04
#> 5 <split [36408/4045]> Fold05
#> 6 <split [36408/4045]> Fold06
#> 7 <split [36408/4045]> Fold07
#> 8 <split [36409/4044]> Fold08
#> 9 <split [36409/4044]> Fold09
#> 10 <split [36409/4044]> Fold10
ranger_spec <-
rand_forest(trees = 1e3, min_n = tune(), mtry = tune()) %>%
set_engine("ranger") %>%
set_mode("regression")
xgb_spec <- boost_tree(tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(),
min_n = tune(), sample_size = tune(), trees = tune()) %>%
set_engine("xgboost") %>%
set_mode("regression")
cubist_spec <- cubist_rules(committees = tune(), neighbors = tune()) %>%
set_engine("Cubist") %>%
set_mode("regression")
base_rec <- recipe(formula = price ~ carat + cut + clarity,
data = carat_train) %>%
step_string2factor(cut, clarity)
min_pre_proc <-
workflow_set(
preproc = list(min_preproc = base_rec),
models = list(RF = ranger_spec, xgboost = xgb_spec, Cubist = cubist_spec)
)
## Evaluate models
race_ctrl <-
control_race(
save_pred = TRUE,
parallel_over = "everything",
save_workflow = TRUE
)
race_results_carat <-
min_pre_proc %>%
workflow_map("tune_race_anova",
seed = 1503,
resamples = carat_folds,
grid = 25,
control = race_ctrl,
verbose = TRUE)
#> i 1 of 3 tuning: min_preproc_RF
#> i Creating pre-processing data to finalize unknown parameter: mtry
#> ✓ 1 of 3 tuning: min_preproc_RF (4m 25.9s)
#> i 2 of 3 tuning: min_preproc_xgboost
#> Warning: All models failed. See the `.notes` column.
#> x 2 of 3 tuning: min_preproc_xgboost failed with: There were no valid metrics for the ANOVA model.
#> i 3 of 3 tuning: min_preproc_Cubist
#> ✓ 3 of 3 tuning: min_preproc_Cubist (3m 46.4s)
Created on 2022-01-31 by the reprex package (v2.0.1)
Session info
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: aarch64-apple-darwin20 (64-bit)
#> Running under: macOS Monterey 12.1
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] nl_BE.UTF-8/nl_BE.UTF-8/nl_BE.UTF-8/C/nl_BE.UTF-8/nl_BE.UTF-8
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] Cubist_0.3.0 lattice_0.20-44 xgboost_1.5.0.2 ranger_0.13.1
#> [5] vctrs_0.3.8 rlang_0.4.12 finetune_0.1.0 rules_0.1.2
#> [9] baguette_0.1.1 here_1.0.1 forcats_0.5.1 stringr_1.4.0
#> [13] readr_2.1.1 tidyverse_1.3.1 yardstick_0.0.9 workflowsets_0.1.0
#> [17] workflows_0.2.4 tune_0.1.6 tidyr_1.1.4 tibble_3.1.6
#> [21] rsample_0.1.1 recipes_0.1.17 purrr_0.3.4 parsnip_0.1.7
#> [25] modeldata_0.1.1 infer_1.0.0 ggplot2_3.3.5 dplyr_1.0.7
#> [29] dials_0.0.10 scales_1.1.1 broom_0.7.11 tidymodels_0.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] minqa_1.2.4 colorspace_2.0-2 ellipsis_0.3.2 class_7.3-19
#> [5] rprojroot_2.0.2 fs_1.5.2 rstudioapi_0.13 listenv_0.8.0
#> [9] furrr_0.2.3 earth_5.3.1 mvtnorm_1.1-3 prodlim_2019.11.13
#> [13] fansi_1.0.2 lubridate_1.8.0 xml2_1.3.3 codetools_0.2-18
#> [17] splines_4.1.2 doParallel_1.0.16 libcoin_1.0-9 knitr_1.37
#> [21] Formula_1.2-4 jsonlite_1.7.3 nloptr_1.2.2.3 pROC_1.18.0
#> [25] dbplyr_2.1.1 compiler_4.1.2 httr_1.4.2 backports_1.4.1
#> [29] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0 cli_3.1.1
#> [33] prettyunits_1.1.1 htmltools_0.5.2 tools_4.1.2 partykit_1.2-15
#> [37] gtable_0.3.0 glue_1.6.0 reshape2_1.4.4 Rcpp_1.0.8
#> [41] cellranger_1.1.0 DiceDesign_1.9 nlme_3.1-152 iterators_1.0.13
#> [45] inum_1.0-4 timeDate_3043.102 gower_0.2.2 xfun_0.29
#> [49] globals_0.14.0 lme4_1.1-27.1 rvest_1.0.2 lifecycle_1.0.1
#> [53] future_1.23.0 MASS_7.3-54 ipred_0.9-12 hms_1.1.1
#> [57] parallel_4.1.2 yaml_2.2.1 C50_0.1.5 TeachingDemos_2.12
#> [61] rpart_4.1-15 stringi_1.7.6 highr_0.9 plotrix_3.8-2
#> [65] foreach_1.5.1 lhs_1.1.3 boot_1.3-28 hardhat_0.1.6
#> [69] lava_1.6.10 pkgconfig_2.0.3 evaluate_0.14 tidyselect_1.1.1
#> [73] parallelly_1.30.0 plyr_1.8.6 magrittr_2.0.1 R6_2.5.1
#> [77] generics_0.1.1 DBI_1.1.2 pillar_1.6.4 haven_2.4.3
#> [81] withr_2.4.3 survival_3.2-13 nnet_7.3-16 future.apply_1.8.1
#> [85] modelr_0.1.8 crayon_1.4.2 utf8_1.2.2 tzdb_0.2.0
#> [89] rmarkdown_2.11 grid_4.1.2 readxl_1.3.1 data.table_1.14.2
#> [93] plotmo_3.6.1 reprex_2.0.1 digest_0.6.29 GPfit_1.0-8
#> [97] munsell_0.5.0
@wdkeyzer xgboost models require only numeric predictors; they can't handle any predictors like diamonds$clarity
or diamonds$cut
. You can check out this appendix for more info on preprocessing needed for different models.
Also, if you ever run into trouble with a workflow set like this, I recommend trying to just plain fit the workflow on your training data, or use tune_grid()
. You will likely get a better understanding of where the problems are.
thank you @juliasilge for your help! I've came across the appendix before but didn't think about that... .
Regarding plain fit and tune_grid()
, that's a pro tip that should improve my problem solving in future. Thank you for pointing this out.
Hi Julia, In the section where you describe "Let’s use last_fit() to fit one final time to the training data and evaluate one final time on the testing data." What in the code is demonstrating the model is being used on the test set? For example :
collect_predictions(xgb_last) %>% mn_log_loss(is_home_run, .pred_HR)
@pspangler1 It's this code, where we use last_fit()
:
xgb_last <- xgb_wf %>%
finalize_workflow(select_best(xgb_rs, "mn_log_loss")) %>%
last_fit(bb_split)
If you look at the number of predictions that are coming out of collect_predictions(xgb_last)
you'll notice it is the number of observations in the test set.
Is there a way to also get predictions for the training set?
@cseibold47 We recommend against repredicting the training set for most typical use cases but you can use predict()
with a fitted model on any data, which could include the training set.
Hi Julia, would you be able to tell me... in the tune_race_anova step.... I know you say it's doing ANOVA to determine which aren't likely to be winners... but is it comparing the models using roc_auc or mean_log_loss?
@jtag04 You can read more about this in the docs but the default is to use the first entry in the default metrics()
for your model. You can instead specify a different metric to use via the metrics
argument.
Thanks Julia, that's a big help
On Sat, Oct 15, 2022 at 2:27 PM Julia Silge @.***> wrote:
@jtag04 https://github.com/jtag04 You can read more about this in the docs https://finetune.tidymodels.org/reference/tune_race_anova.html but the default is to use the first entry in the default metrics() for your model. You can instead specify a different metric to use via the metrics argument.
— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1279646194, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEO6QUJHKDOLUL4SSGTWDIQDJANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>
Hi Julia, I think I've run into a bug using finetune::tune_sim_anneal() https://github.com/tidymodels/dials/issues/258 Is this something you've encountered before?
@jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page.
Yeah, totally, creating a reprex is going to take a little bit of doing -as the model/workflow contains sensitive data. I'll totally give it a shot if I don't hear from Max Kuhn in the coming days. Was hoping I might get lucky and someone would recognise what was going on. Has got me miffed. Cheers
On Fri, Oct 21, 2022 at 2:31 AM Julia Silge @.***> wrote:
@jtag04 https://github.com/jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex https://reprex.tidyverse.org/ (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help https://www.tidyverse.org/help/ page.
— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1285752752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEIPX7KWBXQO3GGSGYTWEFQWDANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>
Hi Julia, Have added a reprex to that Dials package issue I've logged. Hopefully that's some help. Cheers, Julian
On Fri, Oct 21, 2022 at 7:55 AM Julian Tagell @.***> wrote:
Yeah, totally, creating a reprex is going to take a little bit of doing -as the model/workflow contains sensitive data. I'll totally give it a shot if I don't hear from Max Kuhn in the coming days. Was hoping I might get lucky and someone would recognise what was going on. Has got me miffed. Cheers
On Fri, Oct 21, 2022 at 2:31 AM Julia Silge @.***> wrote:
@jtag04 https://github.com/jtag04 Hmmm, I haven't seen that before. Opening an issue was the right call, and it would be definitely helpful if you could create a reprex https://reprex.tidyverse.org/ (a minimal reproducible example) for that issue. The goal of a reprex is to make it easier for people to recreate your problem so that they can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help https://www.tidyverse.org/help/ page.
— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/42#issuecomment-1285752752, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADH6BEIPX7KWBXQO3GGSGYTWEFQWDANCNFSM5BI2337A . You are receiving this because you were mentioned.Message ID: @.***>
Hey @juliasilge, I do recognise that we're in "open-source world"... but is there any special way of getting some attention to that Dials issue I've raised?