auto-sklearn
auto-sklearn copied to clipboard
[Suggestion] Purging and embargoing to deal with unintended data leaks in cross validation.
These approaches are often used in financial ML. Can benefit a wide variety of ML tasks though.
In short: Adding a safety gap between the k-folds or train-, test- and validation splits.
These articles explain it in detail:
https://medium.com/mlearning-ai/why-k-fold-cross-validation-is-failing-in-finance-65c895e83fdf
https://blog.quantinsti.com/cross-validation-embargo-purging-combinatorial/
The Combinatorial Purged Cross Validation mentioned there (it is a little better explained here: https://towardsai.net/p/l/the-combinatorial-purged-cross-validation-method) helps creating more walk-forward paths that are purely out-of-sample for increased statistical significance. This was proposed by Marcos Lopez de Prado in the “Advances in financial machine learning”.
Hi @cryptocoinserver,
Thanks for the informative blog post, very interesting indeed. Unfortunately, we do not primarily target time series data and so k-fold cross-validation works well in this scenario. However there has been a few issues about time series before and you can pass in your own sklearn style splitting mechanism as seen in this example:
- https://automl.github.io/auto-sklearn/master/examples/40_advanced/example_resampling.html#scikit-learn-splitter-objects
Further there is a previous issue to do with this here:
- #501
Nice. Thank you for the hint. Will take a look at the PredefinedSplit.
Also as a side note, can K-Fold be used in this example for cross-checking of model performance? https://automl.github.io/auto-sklearn/master/examples/20_basic/example_multioutput_regression.html