mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

Let SequentialFeatureSelector within GridSearchCV evaluate on same test split

Open ptoews opened this issue 4 years ago • 1 comments

Describe the workflow you want to enable

I'm using SFS within GridSearch and as far as I understand it, GridSearch splits the data into train and test, where it fits the inner SFS on train and tests the fitted estimator on the test split. Therefore, the inner SFS trains and tests on the train split only, since it doesn't have access to the (outer) test split. I think it would be better if the same test split was used, especially if the dataset size is small and further splitting makes training more difficult.

Describe your proposed solution

My very hacky solution is currently to use a custom scoring function that ignores the given training data and instead uses the global test dataset, and then pass the indicies parameters (that describe the current feature subset) to the scoring function in this call here: https://github.com/rasbt/mlxtend/blob/2945485168744bbd254378aeda73e2d34ee19024/mlxtend/feature_selection/sequential_feature_selector.py#L38

This is very hacky but so far I couldn't think of a better solution. Above works and improves generalization performance significantly in my case.

ptoews avatar Jun 09 '21 15:06 ptoews

HI there,

I can see how that can be a limitation in grid search. Given that the current SFS is already relatively complicated and has maybe too many bells and whistles for deviating from scikit-learn's expected use, I am wondering if your solution could maybe an example we could add to the documentation rather than adding as another option to the parameter set?

rasbt avatar Jun 13 '21 17:06 rasbt