auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

[Question] Cross validation and Ensembling

Open FelixNeutatz opened this issue 3 years ago • 5 comments

Dear all,

I am wondering how greedy ensembling is implemented for cross-validation. I couldn't really find it in the code. Can anybody give me a hint?

My idea of how it could be implemented:

weights = np.zeros(len(models))
ensemble_sel = EnsembleSelection(ensemble_size=50,
                                      task_type=MULTICLASS_CLASSIFICATION,
                                      random_state=0,
                                      metric=ba)

for k in range(cv_folds):
      validation_indices = get_validation_ids(k)
      ensemble_sel.fit(model_val_predictions[k][validation_indices], y_test[validation_indices], identifiers=None)
      weights += ensemble_sel.weights_
ensemble_sel.weights_ = weights_ / cv_folds

Is this roughly how it works?

Best regards, Felix

FelixNeutatz avatar Sep 07 '22 14:09 FelixNeutatz

Hi @FelixNeutatz,

I spent a while digging through and it seems like we basically consider the cv predictions as one big long array vertically stacked, metric'd with the corresponding targets vertically stacked:

# Not real code but just to use your frame of reference
# Happens in TrainEvaluator
cv_predictions = [[1, 1, 1], [1, 0, 1], [0, 0, 0]]
cv_targets = [[1, 1, 1], [0, 1, 1], [0, 0, 0]] 
stacked_predictions = np.concatenate(cv_predictions)
stacked_targets = np.concatenate(cv_targets)

# Save these to disk, notice how we don't care if it's CV now
save(id="...", stacked_predictions, stacked_targets)

#
# ...
#

# Somewhere else while ensemble building, we can feed these to ensemble selection
predictions, targets = load(id="...")

I'd like to point you to one place in the code where this happens but it's convoluted around several other points. Here's my investigation stacktrace if curious:

  • The EnsembleBuilder is responsible for creating the EnsembleSelection and fitting it. After it prunes some memory and selects candidates we consider for ensemble selection, we create an EnsembleSelection and call fit_ensemble here https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/builder.py#L566-L577
  • Here is the fit method you have in your sample code, called inside the fit_ensemble function above. It loads in predictions for the candidate runs but never seems to care whether it's cross validation or not, must be agnostic to it somehow. https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/builder.py#L932-L942
  • Following the trace of where these predictions get loaded from:
    • https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/run.py#L128-L133
    • https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/run.py#L77-L80
    • Sigh, we've lost the follow-able trace, need to find where predictions are saved
    • It's this file TrainEvaluator
    • Calls this finish_up and file_output at some point which dumps files.
    • Goes over here now save_numrun_to_dir. It basically saves cv_models and regular model predictions to the same place, backing up the previous assumption
  • Back to TrainEvaluator and where it generates this one prediction file per cv evaluation of a pipeline.
  • Won't go into detail here but this is the cv branch of the TrainEvaluator where it generates these predictions
  • At this point we concat it together as mentioned above

eddiebergman avatar Sep 12 '22 10:09 eddiebergman

Thank you so much! This is really helpful. Maybe this should go into the documentation :)

FelixNeutatz avatar Sep 12 '22 10:09 FelixNeutatz

I'll keep it open as something to add into the docs then!

eddiebergman avatar Sep 12 '22 10:09 eddiebergman

TrainEvaluator looks like something that belongs in this documentation, or are they different? https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_resampling.html#cross-validation

BradKML avatar Oct 19 '22 09:10 BradKML

The TrainEvaluator is not something that's exposed to the user so I don't think it would go in there personally, I'm not sure how it would look belonging in there.

eddiebergman avatar Oct 20 '22 07:10 eddiebergman