tune
tune copied to clipboard
Restructuring grid search processing
We're making a substantial change to the workhorse of the package for two reasons:
- complete conversion to the future package.
- enable efficient tuning of postprocessing parameters
Currently, the computing path is:
tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop
└─ tune_grid_loop -> tune_grid_loop_tune (aka fn_tune_grid_loop)
└─ tune_grid_loop_tune -> tune_grid_loop_iter (aka fn_tune_grid_loop_iter)
where
-
tune_grid_loop()is the top-level call to compute. After computations, the individual results (e.g., metrics, predictions, extracts, etc.) are partitioned out of the results object. -
tune_grid_loop_tune()(re)setsparallel_overbased on number of resamples then calls the iterator -
tune_grid_loop_iter()is the script that goes through the conditional execution process from preprocessing to model prediction.
Plan for the new approach
We create a schedule of computations (#974 and #978) for a grid of tuning parameters. This defines the conditional execution that loops over preprocessors, models, and now postprocessors. It also accounts for speed-ups achieved via submodel parameters.
Given the grid and a specific resample rsplit object (that defines the analysis and assessment data), we can run a function to create the schedule and execute it (current pet name is "loopy()"). This will return all of our default results and any optional results (e.g. predictions and extracts).
Let's say we have B resamples and S grid points. We can call loopy() in a few different ways. Currently, tune defaults to a loop of B iterations, each processing the S grid points. However, one option (controlled by parallel_over, see this section of TMwR) "flattens" the loop so that all B*S tasks can be run in parallel.
We can choose which path to take using this pseudocode:
# `splits` is a list of B rsplit objects
# `grid` is the data frame of S candidates (in rows)
# `grid_rows` is `grid` decomposed into a list of S 1-point grid subsets
if (parallel_over == "resamples") {
# The default
# Loop over B splits, process whole grid of S candidates
res <- map(splits, ~ loopy(.x, grid))
} else {
# Do all at once either because preprocessing is cheap or a validation set is being used.
# Make a list of all combinations of indices for splits and candidates.
indices <- crossing(s = seq_along(grid_rows), b = seq_along(splits))
indices <- vec_split(indices, by = 1:nrow(indices))
res <- map(indices, ~ loopy(splits[[.x$b]], grid_rows[[.x$s]]))
}
We'll probably map using future.apply::future_lapply().
We ~~hope~~ think that the new code path is:
tune_grid.* -> tune_grid_workflow
└─ tune_grid_workflow -> tune_grid_loop
tune_grid_loop() will set the data options and options, execute the pseudo code above, and parse the results into different components.
Special cases and notes
-
We will keep Simon's logging method to catalog messages, warnings, and errors methodically.
-
We must branch inside
loopy()to handle h2o processing via the agua package. -
We will not have a dependency on foreach anymore. 😿
-
The pattern used in
.configis currently"Preprocessor{X}Model{X}". We'll change this to"pre{X}_mod{X}_post{X}"where "{X}` is padded with zeros or just zero when there are no pre- or postprocessing tuning parameters. -
In the pseudo-code above,
grid_rowsis a little more complex than it appears. Instead of S 1-point grids, it can have multiple rows when a submodel parameter is being tuned. To do this, we emulatemin_gridand group the grid candidates into unique combinations of all non-submodel parameters. For example, if a regular grid is made with 3 levels for each of 3 parameters (2 non-submodel and a single submodel parameter). The regulargridwill have 27 rows, butgrid_rowswill be a list of 9 grids, each having three rows. Containing the submodels in each sub-grid will allow the schedule andloopy()to gain their submodel speed-ups.