stacks Model performance metrics: ROC, Conf Matrix, compare to single models

Hi tidymodels/stacks team,

I love the stacks package but cannot find any documentation to obtain model performance metrics, akin to the tidymodels workflow pipeline.

I am interested in comparing the model performance of the stacks model to my single model workflows generated using workflowsets.

For this, I would like to plot ROC and obtain sensitivity/specificity and confusion matrix values, not just obtain a single AUC at the end.

This is my current code in {stacks}. Any help would be greatly appreciated. Thanks!

########################################################################
# Define resampling and basic recipe 
########################################################################
set.seed(1234)
cores         <- parallel::detectCores() - 1 # determine number of cores
d_folds       <- vfold_cv(d, v = 10, repeats = 5, strata = status) # creates cross validation
keep_pred     <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
ctrl_grid     <- control_stack_grid() # use same control settings as in numeric response setting 
log_recipe    <- recipe(status ~., data = d) %>% step_log(all_predictors()) 
status_wkfl   <- workflow() %>% add_recipe(log_recipe)

########################################################################
# Define models to predict status
########################################################################
logistic_reg_spec_tuned <- logistic_reg() %>% # no tuning parameters
	set_engine("glm") %>%
	set_mode("classification") 

boost_tree_spec_tuned <- boost_tree() %>%
  set_args(tree_depth = 4, learn_rate = 0.001, loss_reduction = 0.005, mtry = 2, # stop iter
           min_n = 27, sample_size = 0.8125, trees = 10000) %>% 
  set_mode("classification") %>%
  set_engine("xgboost")

nearest_neighbor_spec_tuned <- nearest_neighbor() %>% 
  set_args(neighbors = 14, dist_power = 1.525, weight_func = "gaussian") %>% 
  set_mode("classification") %>%
  set_engine("kknn")

discrim_linear_spec_tuned <- discrim_linear() %>% # no tuning parameters
  set_mode("classification") %>%
  set_engine("MASS")

naive_bayes_spec_tuned <- naive_Bayes(smoothness = 0.5, Laplace = 1.75) %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

discrim_quad_spec_tuned <- discrim_quad() %>% # no tuning parameters
  set_mode("classification") %>%
  set_engine("MASS")

rand_forest_spec_tuned <- rand_forest() %>%
  set_args(trees = 10000, mtry = 2, min_n = 7) %>%
  set_mode("classification") %>%
  set_engine("ranger",
             importance = "permutation",
             num.threads = cores,
             probability = TRUE,
             seed = 1234) 

svm_linear_spec_tuned <- svm_linear() %>%
  set_args(cost = 0.002, margin = 0.167) %>% 
  set_mode("classification") %>%
  set_engine("kernlab")

########################################################################
# Create {workflowsets} object
########################################################################
antibody_wkfl_tuned <- workflow_set(
  preproc = list(log = log_recipe),
  models = list(
    rand_forest = rand_forest_spec_tuned, 
    logistic_reg = logistic_reg_spec_tuned, 
    discrim_linear = discrim_linear_spec_tuned, 
    discrim_quad = discrim_quad_spec_tuned, 
    nearest_neighbor = nearest_neighbor_spec_tuned, 
    boost_tree = boost_tree_spec_tuned, 
    svm_linear = svm_linear_spec_tuned,
    naive_bayes = naive_bayes_spec_tuned
  )
)

########################################################################
# Run resampling
########################################################################
grid_ctrl <- control_grid(
      save_pred = TRUE,
      parallel_over = "everything",
      save_workflow = TRUE
   )

final_models <-
   antibody_wkfl_tuned %>%
   workflow_map(
      seed = 1234,
      resamples = d_folds,
      grid = 25,
      control = grid_ctrl
   )

########################################################################
# Putting my stack together 
########################################################################
antibody_wkfl_stack <- 
  # initialise the stack 
  stacks() %>% 
  # add candidate members i.e., {workflowsets} object
  add_candidates(final_models)

as_tibble(antibody_wkfl_stack)

########################################################################
# Blend the stack 
########################################################################
antibody_wkfl_blend <- 
  antibody_wkfl_stack_nofit %>% 
  blend_predictions()

antibody_wkfl_blend

# view roc_auc for various penalty parameters:
antibody_wkfl_blend$metrics %>% filter(.metric == "roc_auc")

########################################################################
# Fit the stack 
########################################################################
antibody_wkfl_fit_stack <- 
  # initialise the stack 
  stacks() %>% 
  # add candidate members i.e., {workflowsets} object
  add_candidates(final_models) %>% 
  # determine how to combine predictions
  blend_predictions() %>% 
  # fit the candidates with non-zero stacking coefficients
  fit_members() 

antibody_wkfl_fit_stack

########################################################################
# Evaluate stack performance 
########################################################################
# Plot: Trade-off between minimising number of members and optimising performance 
autoplot(antibody_wkfl_stack)

# Plot: Weights
autoplot(antibody_wkfl_stack, type = "weights")

# Collect parameters e.g., for random forest model 
collect_parameters(antibody_wkfl_stack, "log_rand_forest")

Sep 10 '25 22:09 dionnecargy

Thanks for the issue and the kind words on stacks!

You can use predict() or augment() on model stacks just like regular tidymodels workflows, then calculate metrics with yardstick functions:

library(yardstick)

# Generate predictions
stack_predictions <- augment(antibody_wkfl_fit_stack, new_data = test_data)

# ROC curve
stack_predictions %>%
  roc_curve(truth = status, <.pred_positive_class_name>) %>%
  autoplot()

# Multiple metrics
stack_predictions %>%
  metric_set(accuracy, sens, spec, roc_auc)(
    truth = status, 
    estimate = .pred_class,
    <.pred_positive_class_name>
  )

If you want those metrics to live in the stack itself / calculate them during resampling, use the blend_predictions(metric) argument.

I agree that there might be opportunity here for tighter integration on the end-game, e.g. last_fit() or collect_metrics()/compute_metrics().

Sep 11 '25 13:09 simonpcouch

Thank you for your quick response!

I was actually hoping to apply resampling/cross-validation on the final stack (before fitting the full model) to estimate sensitivity, specificity thresholds, and the ROC curve.

The goal is to identify a chosen threshold to decide how to classify predictions from a model, for example, the threshold corresponding to 85% specificity (and its associated sensitivity), so that when the fitted model is applied to new data, I can use that predefined 85% specificity threshold.

Sep 12 '25 23:09 dionnecargy

Apologies if I missed your intent there!

Just to make sure I answer the right question:

Is the "final stack" the output of add_candidates(), blend_predictions(), or fit_members()?
Is the "full model" the meta-learner or the members fitted on the entire training set?
Are the predictions mentioned in "to decide how to classify predictions from a model" the probability predictions from the fitted members or the final, combined probabilities estimated by the model stack?

Sep 15 '25 16:09 simonpcouch

No worries at all - i likely didn't communicate that clearly above (still very new to machine learning!)

In this case, the final stack refers to the output of blend_predictions() and
The full model refers to the meta-learner that is fitted on the entire training set.
I am looking for predictions that would be identified using collect_predictions() in tidymodels of resampled a resampled model that is not fit to data.

Please correct me if my understanding is incorrect! From what I do understand, I was hoping that I could obtain these performance metrics from the meta-learner to then compare to the single models.

Sep 16 '25 06:09 dionnecargy

Hi {stacks} developers, I was wondering if there was any update or progress on whether the above is feasible or if there is a way to obtain at least a ROC curve of the final stack given than an AUC is given? Thanks again and I really love this package!

Oct 31 '25 04:10 dionnecargy