tune icon indicating copy to clipboard operation
tune copied to clipboard

cache duplicated preprocessor fits

Open simonpcouch opened this issue 1 year ago • 0 comments

Closes #955. A proof of concept for cutting out duplicated preprocessor fits for iterative tuning approaches. A bit of a contrived example, but on main:

library(tidymodels)
library(embed)

hotel_rates$arrival_date <- NULL
  
bench::mark(
  tune_bayes = tune_bayes(
    workflow(
      recipe(avg_price_per_room ~ ., hotel_rates) %>% 
        step_lencode_glm(all_nominal_predictors(), outcome = vars(avg_price_per_room)),
      linear_reg(engine = "glmnet", penalty = tune())
    ),
    vfold_cv(hotel_rates)
  )
)
#> ! No improvement for 10 iterations; returning current results.
#> ! No improvement for 10 iterations; returning current results.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tune_bayes    2.16m    2.16m   0.00773      38GB    0.897

Created on 2024-11-01 with reprex v2.1.1

After this PR:

library(tidymodels)
library(embed)

hotel_rates$arrival_date <- NULL
  
bench::mark(
  tune_bayes = tune_bayes(
    workflow(
      recipe(avg_price_per_room ~ ., hotel_rates) %>% 
        step_lencode_glm(all_nominal_predictors(), outcome = vars(avg_price_per_room)),
      linear_reg(engine = "glmnet", penalty = tune())
    ),
    vfold_cv(hotel_rates)
  )
)
#> ! No improvement for 10 iterations; returning current results.
#> ! No improvement for 10 iterations; returning current results.
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 1 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tune_bayes      20s      20s    0.0500    5.16GB     2.05

Created on 2024-11-01 with reprex v2.1.1

  • I'm not fully convinced it's actually this simple. 1) Is the split_id and iter_msg_preprocessor enough to uniquely identify a unique preprocessor fit when tuning iteratively? 2) What are the implications with different types of parallelism? We haven't used the progress_env so far for parallelized computations, yet, but also the fact that has_cached_result() would only return TRUE for iterative searches might mean that the needed information is always available in the parent process.
  • This same machinery could be used to e.g. cache model fits if only tuning the postprocessor.

simonpcouch avatar Nov 01 '24 19:11 simonpcouch