tune icon indicating copy to clipboard operation
tune copied to clipboard

Restructuring grid search processing

Open topepo opened this issue 8 months ago • 1 comments

We're making a substantial change to the workhorse of the package for two reasons:

  • complete conversion to the future package.
  • enable efficient tuning of postprocessing parameters

Currently, the computing path is:

tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop
    └─ tune_grid_loop -> tune_grid_loop_tune (aka fn_tune_grid_loop)
         └─ tune_grid_loop_tune -> tune_grid_loop_iter (aka fn_tune_grid_loop_iter)

where

  • tune_grid_loop() is the top-level call to compute. After computations, the individual results (e.g., metrics, predictions, extracts, etc.) are partitioned out of the results object.

  • tune_grid_loop_tune() (re)sets parallel_over based on number of resamples then calls the iterator

  • tune_grid_loop_iter() is the script that goes through the conditional execution process from preprocessing to model prediction.

Plan for the new approach

We create a schedule of computations (#974 and #978) for a grid of tuning parameters. This defines the conditional execution that loops over preprocessors, models, and now postprocessors. It also accounts for speed-ups achieved via submodel parameters.

Given the grid and a specific resample rsplit object (that defines the analysis and assessment data), we can run a function to create the schedule and execute it (current pet name is "loopy()"). This will return all of our default results and any optional results (e.g. predictions and extracts).

Let's say we have B resamples and S grid points. We can call loopy() in a few different ways. Currently, tune defaults to a loop of B iterations, each processing the S grid points. However, one option (controlled by parallel_over, see this section of TMwR) "flattens" the loop so that all B*S tasks can be run in parallel.

We can choose which path to take using this pseudocode:

# `splits` is a list of B rsplit objects
# `grid` is the data frame of S candidates (in rows)
# `grid_rows` is `grid` decomposed into a list of S 1-point grid subsets

if (parallel_over == "resamples") {
  # The default
  # Loop over B splits, process whole grid of S candidates
	
  res <- map(splits, ~ loopy(.x, grid))
} else {
  # Do all at once either because preprocessing is cheap or a validation set is being used. 
  
  # Make a list of all combinations of indices for splits and candidates. 
  indices <- crossing(s = seq_along(grid_rows), b = seq_along(splits))
  indices <- vec_split(indices, by = 1:nrow(indices))
	
  res <- map(indices, ~ loopy(splits[[.x$b]], grid_rows[[.x$s]]))
}

We'll probably map using future.apply::future_lapply().

We ~~hope~~ think that the new code path is:

tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop

tune_grid_loop() will set the data options and options, execute the pseudo code above, and parse the results into different components.

Special cases and notes

  • We will keep Simon's logging method to catalog messages, warnings, and errors methodically.

  • We must branch inside loopy() to handle h2o processing via the agua package.

  • We will not have a dependency on foreach anymore. 😿

  • The pattern used in .config is currently "Preprocessor{X}Model{X}". We'll change this to "pre{X}_mod{X}_post{X}" where "{X}` is padded with zeros or just zero when there are no pre- or postprocessing tuning parameters.

  • In the pseudo-code above, grid_rows is a little more complex than it appears. Instead of S 1-point grids, it can have multiple rows when a submodel parameter is being tuned. To do this, we emulate min_grid and group the grid candidates into unique combinations of all non-submodel parameters. For example, if a regular grid is made with 3 levels for each of 3 parameters (2 non-submodel and a single submodel parameter). The regular grid will have 27 rows, but grid_rows will be a list of 9 grids, each having three rows. Containing the submodels in each sub-grid will allow the schedule and loopy() to gain their submodel speed-ups.

topepo avatar Feb 15 '25 17:02 topepo