tune
tune copied to clipboard
cache duplicated preprocessor fits
Closes #955. A proof of concept for cutting out duplicated preprocessor fits for iterative tuning approaches. A bit of a contrived example, but on main:
library(tidymodels)
library(embed)
hotel_rates$arrival_date <- NULL
bench::mark(
tune_bayes = tune_bayes(
workflow(
recipe(avg_price_per_room ~ ., hotel_rates) %>%
step_lencode_glm(all_nominal_predictors(), outcome = vars(avg_price_per_room)),
linear_reg(engine = "glmnet", penalty = tune())
),
vfold_cv(hotel_rates)
)
)
#> ! No improvement for 10 iterations; returning current results.
#> ! No improvement for 10 iterations; returning current results.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tune_bayes 2.16m 2.16m 0.00773 38GB 0.897
Created on 2024-11-01 with reprex v2.1.1
After this PR:
library(tidymodels)
library(embed)
hotel_rates$arrival_date <- NULL
bench::mark(
tune_bayes = tune_bayes(
workflow(
recipe(avg_price_per_room ~ ., hotel_rates) %>%
step_lencode_glm(all_nominal_predictors(), outcome = vars(avg_price_per_room)),
linear_reg(engine = "glmnet", penalty = tune())
),
vfold_cv(hotel_rates)
)
)
#> ! No improvement for 10 iterations; returning current results.
#> ! No improvement for 10 iterations; returning current results.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tune_bayes 20s 20s 0.0500 5.16GB 2.05
Created on 2024-11-01 with reprex v2.1.1
- I'm not fully convinced it's actually this simple. 1) Is the
split_idanditer_msg_preprocessorenough to uniquely identify a unique preprocessor fit when tuning iteratively? 2) What are the implications with different types of parallelism? We haven't used theprogress_envso far for parallelized computations, yet, but also the fact thathas_cached_result()would only returnTRUEfor iterative searches might mean that the needed information is always available in the parent process. - This same machinery could be used to e.g. cache model fits if only tuning the postprocessor.