Issue when using Skfolio CombinatorialPurgedCV alongside GridSearchCV object [BUG]
Describe the bug
Hello! The following TypeError occures when calling the .split() method of CombinatorialPurgedCV class, inside of a GridSearchCV class from sklearn.
TypeError: unhashable type: 'numpy.ndarray'
To Reproduce
<
import numpy as np
import pandas as pd
import xgboost as xgb
import skfolio
import sklearn
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_classification
from skfolio.model_selection import CombinatorialPurgedCV
# Generate synthetic dataset
X, y = make_classification(n_samples=1000, n_features=20, random_state=42)
# Define XGBoost Classifier
xgb_model = xgb.XGBClassifier(eval_metric='logloss', random_state=72)
# Define parameter grid
param_grid = {
'learning_rate': np.arange(0.05, 1.0, 0.05)
}
# Create CPCV object
N = 6
k = 2
cpcv = CombinatorialPurgedCV(n_folds=N, n_test_folds=2)
cpcv.summary(X)
# Perform GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, scoring='accuracy', cv=cpcv.split(X,y), verbose=1, n_jobs=-1)
grid_search.fit(X, y)
# Potential Error source (lines 310, 311 in _combinatorial.py):
fold_index = {fold_id: np.argwhere(fold_idx == fold_id).reshape(-1) for fold_id in range(self.n_folds)}
for i in range(self.n_splits):
train_index_list = np.argwhere(index_train_test[:, i] == 0).reshape(-1)
test_index_list = [fold_index[fold_id] for fold_id, _ in np.argwhere(recombine_paths == i)]
>
Expected behavior
Investigate and potentially update code where relevant so GridSearhCV integration could work properly. Additional context
The issue appears in the .split() method where the train/test indices are NumPy array objects, before passing them to the generator (yield). See code above, under Potential Error source. Would setting the test set index object as follows help?
test_index = np.argwhere(index_train_test[:, i] == 1).reshape(-1) # Analogy to train set index
Happy to hear your thoughts!
Versions
Hello, it's quite interesting that you decided to use CombinatorialPurgedCV for general ML estimators rather than skfolio's portfolio models.
The split method of CombinatorialPurgedCV differs from the scikit-learn cross-validator API because it returns a list of two test indices, as opposed to the single test indices you get from non-combinatorial cross-validators.
The reason we return a list of test indices and not their aggregation is that, in asset allocation, you want to reconstruct the N test paths and analyze the distributions of your N out-of-sample portfolios. Therefore, we implemented our own cross_val_predict to handle this and reconstruct the N test paths.
However, your example demonstrates that it's also valid to use CombinatorialPurgedCV with GridSearchCV for general ML estimators. The fix is relatively simple: we just need to aggregate both test indices in split. That said, we still need to account for the current asset allocation approach, so I'll need to consider the best way to support both use cases within the API.
In the meantime, you can do the following:
class CPCV(CombinatorialPurgedCV):
def split(self, X, y=None, groups=None):
for train_index, test_index_list in super().split(X, y, groups):
yield train_index, np.concatenate(test_index_list)
cpcv = CPCV(n_folds=N, n_test_folds=k)
# Perform GridSearchCV
grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid,
scoring='accuracy', cv=cpcv.split(X, y), verbose=1,
n_jobs=-1)
grid_search.fit(X, y)
After some consideration, aggregating test indices would create overlapping test sets, reduce variability in test conditions, and potentially introduce bias.
However, it could still be a valid approach for datasets without sequential dependency.
Nevertheless, the "right" approach is to implement a new GridSearchCV that handles combinatorial CV directly, rather than aggregating indices to fit the current implementation.
Hi Hugo,
Thank you for your detailed reply. I was just starting to play around with skfolio's objects and wanted to have a check how it combines with other objects from scikit-learn. Indeed, I will look into cross_val_score function from skfolio to get a better understanding. Fully agree, a direct implementation would be desireable, where this new GridSearchCV could incorporate CPCV.
An idea of a quick fix make it work with GridSearchCV and compare the results against a custom CPCV implementation was to do the following:
test_index = np.argwhere(index_train_test[:, i] == 1).reshape(-1)
But now I see your comment about using:
np.concatenate(test_index_list)
Should the outcome of the above be identical?
In any case I did some custom CPCV implementations in the past and I usually compare against to check for any differences: will look into it after the holidays and keep you posted. Enjoy the upcoming holidays with your loved ones!
yes the outcome should be the same.
Thanks, have great holidays too!