tune icon indicating copy to clipboard operation
tune copied to clipboard

Issue with tune_bayes and random forest: This argument contains unknowns: `mtry`

Open MatthieuStigler opened this issue 8 months ago • 5 comments

Running tune_bayes with a random forest with mtry = tune(), I get the error message:

Error in dials::grid_space_filling(): #> ✖ This argument contains unknowns: mtry

the same issue does not happen with tune_grid, which seems to add an additional step:

i Creating pre-processing data to finalize unknown parameter: mtry

Am I missing something, or is it maybe an issue with a missing call somehwere to finalize() in tune_bayes? Additionally, can I trigger myself the finalize() step ?

This issue seems linked to https://github.com/tidymodels/finetune/issues/39?

library(tidymodels)

# 1. Simulate a synthetic dataset
set.seed(123)
data <- data.frame(
  Yield = rnorm(100, mean = 50, sd = 10),   # Target variable
  Soil_pH = runif(100, 5.5, 7.5),           # Predictor
  Nitrogen = rnorm(100, mean = 100, sd = 20),
  Rainfall = rnorm(100, mean = 200, sd = 50)
)

# 2. Create train-test split
data_split <- initial_split(data, prop = 0.8)
training <- training(data_split)

basic_rec <- recipe(Yield  ~ ., data = training)

#3. recipes, workflows, etc
rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

workflow_simple <- 
  workflow_set(
    preproc = list(simple = basic_rec), 
    models = list(RF = rf_spec)
  )



## grid
race_results_tune_grid <-
  workflow_simple %>%
  workflow_map(
    fn="tune_grid",
    resamples = vfold_cv(training, repeats = 5))
#> i Creating pre-processing data to finalize unknown parameter: mtry
race_results_tune_grid$result
#> [[1]]
#> # Tuning results
#> # 10-fold cross-validation repeated 5 times 
#> # A tibble: 50 × 5
#>    splits         id      id2    .metrics          .notes          
#>    <list>         <chr>   <chr>  <list>            <list>          
#>  1 <split [72/8]> Repeat1 Fold01 <tibble [20 × 6]> <tibble [0 × 3]>
#>  2 <split [72/8]> Repeat1 Fold02 <tibble [20 × 6]> <tibble [0 × 3]>
#>  3 <split [72/8]> Repeat1 Fold03 <tibble [20 × 6]> <tibble [0 × 3]>
#>  4 <split [72/8]> Repeat1 Fold04 <tibble [20 × 6]> <tibble [0 × 3]>
#>  5 <split [72/8]> Repeat1 Fold05 <tibble [20 × 6]> <tibble [0 × 3]>
#>  6 <split [72/8]> Repeat1 Fold06 <tibble [20 × 6]> <tibble [0 × 3]>
#>  7 <split [72/8]> Repeat1 Fold07 <tibble [20 × 6]> <tibble [0 × 3]>
#>  8 <split [72/8]> Repeat1 Fold08 <tibble [20 × 6]> <tibble [0 × 3]>
#>  9 <split [72/8]> Repeat1 Fold09 <tibble [20 × 6]> <tibble [0 × 3]>
#> 10 <split [72/8]> Repeat1 Fold10 <tibble [20 × 6]> <tibble [0 × 3]>
#> # ℹ 40 more rows

## bayes
race_results_tune_bayes <-
  workflow_simple %>%
  workflow_map(
    fn="tune_bayes",
    resamples = vfold_cv(training, repeats = 5))
race_results_tune_bayes$result
#> [[1]]
#> [1] "Error in dials::grid_space_filling(param, size = n) : \n  ✖ This argument contains unknowns: `mtry`.\nℹ See the `dials::finalize()` function.\n"
#> attr(,"class")
#> [1] "try-error"
#> attr(,"condition")
#> <error/rlang_error>
#> Error in `dials::grid_space_filling()`:
#> ✖ This argument contains unknowns: `mtry`.
#> ℹ See the `dials::finalize()` function.
#> ---
#> Backtrace:
#>      ▆
#>   1. ├─workflow_simple %>% ...
#>   2. ├─workflowsets::workflow_map(...)
#>   3. │ ├─base::system.time(...)
#>   4. │ ├─withr::with_seed(...)
#>   5. │ │ └─withr::with_preserve_seed(...)
#>   6. │ ├─base::try(rlang::eval_tidy(cl), silent = TRUE)
#>   7. │ │ └─base::tryCatch(...)
#>   8. │ │   └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   9. │ │     └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  10. │ │       └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  11. │ └─rlang::eval_tidy(cl)
#>  12. ├─tune::tune_bayes(object = `<workflow>`, resamples = `<vfold[,3]>`)
#>  13. └─tune:::tune_bayes.workflow(object = `<workflow>`, resamples = `<vfold[,3]>`)
#>  14.   └─tune:::tune_bayes_workflow(...)
#>  15.     └─tune::check_initial(...)
#>  16.       └─tune:::create_initial_set(pset, n = x, checks = checks)
#>  17.         ├─dials::grid_space_filling(param, size = n)
#>  18.         └─dials:::grid_space_filling.parameters(param, size = n)

Created on 2025-03-16 with reprex v2.1.1

MatthieuStigler avatar Mar 16 '25 10:03 MatthieuStigler

a simpler reprex here:

library(tidymodels)

data(cells, package = "modeldata")
cells <- cells %>% select(-case)
folds <- bootstraps(cells, times = 5)

xgb_spec <-
  boost_tree(mtry = tune(), trees = 500) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

xgb_wf <- workflow(class ~ ., xgb_spec)

tune_bayes(xgb_wf, resamples = folds, iter = 3)
#> Error in `dials::grid_space_filling()`:
#> ✖ This argument contains unknowns: `mtry`.
#> ℹ See the `dials::finalize()` function.

Created on 2025-03-16 with reprex v2.1.1

MatthieuStigler avatar Mar 16 '25 10:03 MatthieuStigler

If you can, I would use update() to set the mtry range (if you know it). For example:


basic_rec <- recipe(Yield  ~ ., data = training)

# Changes here: 
rf_param <- 
  rf_spec %>% 
  extract_parameter_set_dials() %>% 
  update(mtry = mtry(c(1, 3)))

rf_spec <- 
  rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
  set_engine("ranger") %>% 
  set_mode("regression")

workflow_simple <- 
  workflow_set(
    preproc = list(simple = basic_rec), 
    models = list(RF = rf_spec)
  ) %>% 
  # Changes here: 
  option_add(param_info = rf_param)

When you use a recipe, you could be altering the number of predictors via things like step_pca() or step_dummy(). If you are tuning steps in a recipe, then you should get this error...

but you are not (and the formula should be able to resolve the range of mtry).

So it's a bug. I'll take a look to figure out the issue. In the meantime, set the range og mtry if you can.

topepo avatar Mar 18 '25 21:03 topepo

Thanks a lot @topepo for looking into this! I look forward to the bug fixing, and thanks for providing a workaround.

Regarding the workaround: note that I am actually doing both tune_grid() and then tune_bayes(). Would you have any advice on how I could use the results from tune_grid to inform the choice for the tune_bayes parameter mtry?

  • One approach would be to use select_best to get the best mtry, but maybe this defeats the goal of the bayes search, that is to search over a larger parameter space?
  • so maybe I could just pick all mtry that were elicited through the (implicit call to update?) tune_grid, Creating pre-processing data to finalize unknown parameter: mtry?

I guess a related question is whether, once the function will be fixed, one will have the lengthy process Creating pre-processing data to finalize unknown parameter: mtry triggered for each tune strategy, or whether there will be a way to reuse it from one tune strategy to the other?

Thanks a lot!

MatthieuStigler avatar Mar 19 '25 08:03 MatthieuStigler

Passing a parameter set using the param_info argument is the best option if you know an upper limit for mtry. Everything ~will~should just work in that case because the unknown in the mtry parameter range is the trigger.

I've added an issue to make the parameter set convey (when possible) so you don't have to repeatedly use param_info but the primary issue is getting the unknown out of the mtry parameter object.

topepo avatar Mar 19 '25 12:03 topepo

great, thanks a lot for your answer! To be sure I understood you correctly, if I do first a tune_grid, then I could just use that max value for manually setting tune_bayes() for now?

max_mtry_grid <- max(race_results_tune_grid$result[[1]]$.metrics[[1]]$mtry)
update(mtry = mtry(c(1, max_mtry_grid)))

thanks again!

MatthieuStigler avatar Mar 19 '25 13:03 MatthieuStigler

Sorry for the belated response.

I believe that will work if you use a space-filling design (or one where we definitely have a design point on the extremes).

topepo avatar Apr 15 '25 17:04 topepo

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Apr 30 '25 00:04 github-actions[bot]