parsnip icon indicating copy to clipboard operation
parsnip copied to clipboard

parsnip doesn't patch `params` argument with `xgboost` engine in `boost_tree`

Open john-b-edwards opened this issue 3 years ago • 2 comments

The problem

I'm having trouble with passing parameters to set_engine for xgboost and getting tune_bayes to work. I hadn't encountered this kind of issue before despite implementing this kind of logic in other code, so my supposition is that it's something to do either with a new version of xgboost or some change in parnsip.

Reproducible example

This code works:

library(tidymodels)
rec <- recipe(vs ~ ., data = mtcars |>
                mutate(vs = as.factor(vs)))

folds <- vfold_cv(mtcars |>
                    mutate(vs = as.factor(vs)), v = 2)

xgb_spec <- boost_tree(
  trees = 25,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune(),
)

metrics <- metric_set(mn_log_loss)

std_xgb_spec <- xgb_spec |>
  set_engine("xgboost") |>
  set_mode("classification")

wf <- workflow() |>
  add_model(std_xgb_spec) |>
  add_recipe(rec)

params <- hardhat::extract_parameter_set_dials(wf) |>
  finalize(mtcars)

set.seed(123)
res <- wf |>
  tune_bayes(
    resamples = folds,
    param_info = params,
    initial = 3,
    iter = 10,
    metrics = metrics,
    control = control_bayes(
      no_improve = 10,
      verbose = F,
      time_limit = 240
    )
)
#> ! There are 6 tuning parameters and 3 grid points were requested.
#> • There are more tuning parameters than there are initial points. This is likely to cause numerical issues in the first few search iterations.
#> ! Fold1: preprocessor 1/1, model 2/3: 38 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold1: preprocessor 1/1, model 3/3: 22 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold2: preprocessor 1/1, model 2/3: 38 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold2: preprocessor 1/1, model 3/3: 22 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 3
#>   data points to do so. This may cause errors or a poor model fit.
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 4
#>   data points to do so. This may cause errors or a poor model fit.
#> 
#> ! Fold1: preprocessor 1/1, model 1/1: 34 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold2: preprocessor 1/1, model 1/1: 34 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 5
#>   data points to do so. This may cause errors or a poor model fit.
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 6
#>   data points to do so. This may cause errors or a poor model fit.
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 7
#>   data points to do so. This may cause errors or a poor model fit.

But in explicitly setting parameters, the gaussian process fails:

library(tidymodels)
rec <- recipe(vs ~ ., data = mtcars |>
                mutate(vs = as.factor(vs)))

folds <- vfold_cv(mtcars |>
                    mutate(vs = as.factor(vs)), v = 2)

xgb_spec <- boost_tree(
  trees = 25,
  tree_depth = tune(),
  min_n = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  learn_rate = tune(),
)

metrics <- metric_set(mn_log_loss)

std_xgb_spec <- xgb_spec |>
  set_engine("xgboost",
             params = list(
               tree_method = "hist",
               eval_metric = "logloss",
               objective = "binary:logistic")) |>
  set_mode("classification")

wf <- workflow() |>
  add_model(std_xgb_spec) |>
  add_recipe(rec)

params <- hardhat::extract_parameter_set_dials(wf) |>
  finalize(mtcars)

set.seed(123)
res <- wf |>
  tune_bayes(
    resamples = folds,
    param_info = params,
    initial = 3,
    iter = 10,
    metrics = metrics,
    control = control_bayes(
      no_improve = 10,
      verbose = F,
      time_limit = 240
    )
)
#> ! There are 6 tuning parameters and 3 grid points were requested.
#> • There are more tuning parameters than there are initial points. This is likely to cause numerical issues in the first few search iterations.
#> ! Fold1: preprocessor 1/1, model 2/3: 38 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold1: preprocessor 1/1, model 3/3: 22 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold2: preprocessor 1/1, model 2/3: 38 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! Fold2: preprocessor 1/1, model 3/3: 22 samples were requested but there were 16 rows in the data. 16 will be...
#> 
#> ! All of the mn_log_loss values were identical. The Gaussian process model
#>   cannot be fit to the data. Try expanding the range of the tuning
#>   parameters.
#> 
#> ! The Gaussian process model is being fit using 6 features but only has 3
#>   data points to do so. This may cause errors or a poor model fit.
#> 
#> x Gaussian process model:
#>   Error in GP_deviance(beta = row, X = X, Y = Y, nug_thres = nug_thres, ...
#>               unable to find optimum parameters
#> Error in `check_gp_failure()`:
#> ! Gaussian process model was not fit.
#> ✖ Optimization stopped prematurely; returning current results.

This is my session info:

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xgboost_1.6.0.1       yardstick_1.0.0       workflowsets_1.0.0    workflows_1.0.0      
 [5] tune_1.0.0            tidyr_1.2.0           tibble_3.1.7          rsample_1.0.0        
 [9] recipes_1.0.1         purrr_0.3.4           parsnip_1.0.0         modeldata_1.0.0      
[13] infer_1.0.2           ggplot2_3.3.6         dplyr_1.0.9           dials_1.0.0          
[17] scales_1.2.0          broom_1.0.0           tidymodels_1.0.0.9000

loaded via a namespace (and not attached):
 [1] fs_1.5.2           usethis_2.1.6      lubridate_1.8.0    devtools_2.4.3    
 [5] DiceDesign_1.9     rprojroot_2.0.3    tools_4.2.1        backports_1.4.1   
 [9] utf8_1.2.2         R6_2.5.1           rpart_4.1.16       DBI_1.1.3         
[13] colorspace_2.0-3   nnet_7.3-17        withr_2.5.0        tidyselect_1.1.2  
[17] prettyunits_1.1.1  processx_3.7.0     curl_4.3.2         compiler_4.2.1    
[21] cli_3.3.0          callr_3.7.1        digest_0.6.29      rmarkdown_2.14    
[25] htmltools_0.5.3    pkgconfig_2.0.3    parallelly_1.32.1  sessioninfo_1.2.2 
[29] lhs_1.1.5          highr_0.9          fastmap_1.1.0      rlang_1.0.4       
[33] rstudioapi_0.13    generics_0.1.3     jsonlite_1.8.0     magrittr_2.0.3    
[37] Matrix_1.4-1       Rcpp_1.0.9         munsell_0.5.0      fansi_1.0.3       
[41] GPfit_1.0-8        clipr_0.8.0        lifecycle_1.0.1    furrr_0.3.0       
[45] yaml_2.3.5         MASS_7.3-57        pkgbuild_1.3.1     grid_4.2.1        
[49] parallel_4.2.1     listenv_0.8.0      crayon_1.5.1       lattice_0.20-45   
[53] splines_4.2.1      knitr_1.39         ps_1.7.1           pillar_1.8.0      
[57] future.apply_1.9.0 codetools_0.2-18   pkgload_1.3.0      reprex_2.0.1      
[61] glue_1.6.2         evaluate_0.15      data.table_1.14.2  remotes_2.4.2     
[65] vctrs_0.4.1        foreach_1.5.2      gtable_0.3.0       future_1.26.1     
[69] assertthat_0.2.1   cachem_1.0.6       xfun_0.31          gower_1.0.0       
[73] prodlim_2019.11.13 class_7.3-20       survival_3.3-1     timeDate_4021.104 
[77] iterators_1.0.14   memoise_2.0.1      hardhat_1.2.0      lava_1.6.10       
[81] globals_0.15.1     ellipsis_0.3.2     ipred_0.9-13   

if it's at all relevant, I am using the GPU-enabled windows version of xgboost 1.6.0 from here.

john-b-edwards avatar Jul 22 '22 04:07 john-b-edwards

Thanks for the issue and reprex! This was a hard one to track down, but your inclination that this is a xgboost + parsnip issue specifically was correct. A slightly more minimal reprex:

library(parsnip)
library(modeldata)
  
class_data <- sim_classification(num_samples = 1e4)
  
# basic bt defs with differing min_n
xgb_spec_2 <- boost_tree(min_n = 2) |> set_mode("classification")

# leave params alone
xgb_spec_2_def <- xgb_spec_2 %>% set_engine("xgboost")

# set objective to its default explicitly
xgb_spec_2_eng <- xgb_spec_2 %>% set_engine("xgboost", params = list(objective = "binary:logistic"))

# with default params:
set.seed(1)
fit(xgb_spec_2_def, class ~ ., data = class_data)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> raw: 67 Kb 
#> call:
#>   xgboost::xgb.train(params = list(eta = 0.3, max_depth = 6, gamma = 0, 
#>     colsample_bytree = 1, colsample_bynode = 1, min_child_weight = 2, 
#>     subsample = 1, objective = "binary:logistic"), data = x$data, 
#>     nrounds = 15, watchlist = x$watchlist, verbose = 0, nthread = 1)
#> params (as set within xgb.train):
#>   eta = "0.3", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "1", min_child_weight = "2", subsample = "1", objective = "binary:logistic", nthread = "1", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 15 
#> niter: 15
#> nfeatures : 15 
#> evaluation_log:
#>     iter training_logloss
#>        1        0.5715742
#>        2        0.4955393
#> ---                      
#>       14        0.2122282
#>       15        0.2033215

# now, with params set explicitly, 
# note that the rest of the params isn't "patched"
set.seed(1)
fit(xgb_spec_2_eng, class ~ ., data = class_data)
#> parsnip model object
#> 
#> ##### xgb.Booster
#> raw: 71.8 Kb 
#> call:
#>   xgboost::xgb.train(params = list(objective = "binary:logistic"), 
#>     data = x$data, nrounds = 15, watchlist = x$watchlist, verbose = 0, 
#>     nthread = 1)
#> params (as set within xgb.train):
#>   objective = "binary:logistic", nthread = "1", validate_parameters = "TRUE"
#> xgb.attributes:
#>   niter
#> callbacks:
#>   cb.evaluation.log()
#> # of features: 15 
#> niter: 15
#> nfeatures : 15 
#> evaluation_log:
#>     iter training_logloss
#>        1        0.5716125
#>        2        0.4953513
#> ---                      
#>       14        0.2068473
#>       15        0.1963899

Created on 2022-07-22 by the reprex package (v2.0.1)

bonsai has some machinery to "patch" the params argument when non-main arguments are passed to param that we can emulate here.

simonpcouch avatar Jul 22 '22 21:07 simonpcouch

There's some related documentation in ?details_boost_tree_xgboost, but leaves some to be said. At the moment, you can pass a different loss as objective to set_engine, but this doesn't allow interfacing with other elements of param.

https://github.com/tidymodels/parsnip/blob/f8505bdf6f039076c1eca5aa37a103a6e072887d/man/rmd/boost_tree_xgboost.md#L142

simonpcouch avatar Jul 22 '22 21:07 simonpcouch

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Sep 01 '22 01:09 github-actions[bot]