auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

[Suggestion] Purging and embargoing to deal with unintended data leaks in cross validation.

Open cryptocoinserver opened this issue 2 years ago • 3 comments

These approaches are often used in financial ML. Can benefit a wide variety of ML tasks though.

In short: Adding a safety gap between the k-folds or train-, test- and validation splits.

These articles explain it in detail:

https://medium.com/mlearning-ai/why-k-fold-cross-validation-is-failing-in-finance-65c895e83fdf

https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/

The Combinatorial Purged Cross Validation mentioned there (it is a little better explained here: https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method) helps creating more walk-forward paths that are purely out-of-sample for increased statistical significance. This was proposed by Marcos Lopez de Prado in the “Advances in financial machine learning”.

cryptocoinserver avatar Sep 25 '22 09:09 cryptocoinserver

Hi @cryptocoinserver,

Thanks for the informative blog post, very interesting indeed. Unfortunately, we do not primarily target time series data and so k-fold cross-validation works well in this scenario. However there has been a few issues about time series before and you can pass in your own sklearn style splitting mechanism as seen in this example:

  • https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_resampling.html#scikit-learn-splitter-objects

Further there is a previous issue to do with this here:

  • #501

eddiebergman avatar Sep 26 '22 06:09 eddiebergman

Nice. Thank you for the hint. Will take a look at the PredefinedSplit.

cryptocoinserver avatar Sep 26 '22 07:09 cryptocoinserver

Also as a side note, can K-Fold be used in this example for cross-checking of model performance? https://automl.github.io/auto-sklearn/master/examples/20_basic/example_multioutput_regression.html

BradKML avatar Oct 17 '22 08:10 BradKML