auto-sklearn
auto-sklearn copied to clipboard
[Question] Cross validation and Ensembling
Dear all,
I am wondering how greedy ensembling is implemented for cross-validation. I couldn't really find it in the code. Can anybody give me a hint?
My idea of how it could be implemented:
weights = np.zeros(len(models))
ensemble_sel = EnsembleSelection(ensemble_size=50,
task_type=MULTICLASS_CLASSIFICATION,
random_state=0,
metric=ba)
for k in range(cv_folds):
validation_indices = get_validation_ids(k)
ensemble_sel.fit(model_val_predictions[k][validation_indices], y_test[validation_indices], identifiers=None)
weights += ensemble_sel.weights_
ensemble_sel.weights_ = weights_ / cv_folds
Is this roughly how it works?
Best regards, Felix
Hi @FelixNeutatz,
I spent a while digging through and it seems like we basically consider the cv predictions as one big long array vertically stacked, metric'd with the corresponding targets vertically stacked:
# Not real code but just to use your frame of reference
# Happens in TrainEvaluator
cv_predictions = [[1, 1, 1], [1, 0, 1], [0, 0, 0]]
cv_targets = [[1, 1, 1], [0, 1, 1], [0, 0, 0]]
stacked_predictions = np.concatenate(cv_predictions)
stacked_targets = np.concatenate(cv_targets)
# Save these to disk, notice how we don't care if it's CV now
save(id="...", stacked_predictions, stacked_targets)
#
# ...
#
# Somewhere else while ensemble building, we can feed these to ensemble selection
predictions, targets = load(id="...")
I'd like to point you to one place in the code where this happens but it's convoluted around several other points. Here's my investigation stacktrace if curious:
- The
EnsembleBuilderis responsible for creating theEnsembleSelectionand fitting it. After it prunes some memory and selects candidates we consider for ensemble selection, we create anEnsembleSelectionand callfit_ensemblehere https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/builder.py#L566-L577 - Here is the
fitmethod you have in your sample code, called inside thefit_ensemblefunction above. It loads in predictions for the candidate runs but never seems to care whether it's cross validation or not, must be agnostic to it somehow. https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/builder.py#L932-L942 - Following the trace of where these predictions get loaded from:
- https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/run.py#L128-L133
- https://github.com/automl/auto-sklearn/blob/013d7eee3c46f0f0c0f66ab7eac9dd1945faf101/autosklearn/ensemble_building/run.py#L77-L80
- Sigh, we've lost the follow-able trace, need to find where predictions are saved
- It's this file
TrainEvaluator - Calls this
finish_upandfile_outputat some point which dumps files. - Goes over here now
save_numrun_to_dir. It basically savescv_modelsand regularmodelpredictions to the same place, backing up the previous assumption
- Back to
TrainEvaluatorand where it generates this one prediction file per cv evaluation of a pipeline. - Won't go into detail here but this is the
cvbranch of theTrainEvaluatorwhere it generates these predictions - At this point we concat it together as mentioned above
Thank you so much! This is really helpful. Maybe this should go into the documentation :)
I'll keep it open as something to add into the docs then!
TrainEvaluator looks like something that belongs in this documentation, or are they different? https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_resampling.html#cross-validation
The TrainEvaluator is not something that's exposed to the user so I don't think it would go in there personally, I'm not sure how it would look belonging in there.