Description

According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases. A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks). In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.

One parameter has been added in during instantiation of the object:

early_stop_rounds : int (default 0)
    Enable early stopping criterion when > 0, this value determines the
    number of iterations after which, if no performance boost has been
    seen, execution is stopped.
    Used only when `k_features == 'best'` or `k_features == 'parsimonious'`

Code Example:

import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs

np.random.seed(0)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# add some noise in order to have features to discard
X_iris_with_noise = np.concatenate(
    (X_iris,
    np.random.randn(X_iris.shape[0], X_iris.shape[1])),
    axis=1)

knn = KNeighborsClassifier()

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=0,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

sfs = SFS(
    estimator=knn,
    k_features='best',
    forward=True,
    early_stop_rounds=2,
    verbose=0)

sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());

... Performances not improved for 2 rounds. Stopping now!

Pull Request Checklist

[ ] Added a note about the modification or contribution to the ./docs/sources/CHANGELOG.md file (if applicable)
[x] Added appropriate unit test functions in the ./mlxtend/*/tests directories (if applicable)
[ ] Modify documentation in the corresponding Jupyter Notebook under mlxtend/docs/sources/ (if applicable)
[ ] Ran PYTHONPATH='.' pytest ./mlxtend -sv and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g., PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv)
[x] Checked for style issues by running flake8 ./mlxtend

Feb 02 '22 13:02 aldder

Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option

k_features="parsimonious"

which will select the smallest feature set within 1 standard error of the best feature set, which helps with this.

I like adding an early_stop option, but I have a few suggestions / concerns regarding the API:

1)

I think that the two parameters

early_stop and early_stop_rounds can be consolidated into a single one. E.g.,


                 if self.early_stop and k != k_to_select:
                     if k_score <= best_score:

could be

                 if self.early_stop_rounds and k != k_to_select:
                     if k_score <= best_score:

What I mean is instead of having

early_stop : bool (default: False)
early_stop_rounds : int (default 3)

this could be simplified to

early_stop_rounds : int (default 0)

2)

The second concern I have is that if a user selects e.g., k_features=(1, 3), early_stop_rounds=3, it's not necessarily guaranteed that there will be 1 to 3 features, which can be confusing.

I wonder if it makes sense to allow early_stopping_rounds only for k_features='best' and k_features='parsimonious', which both explore the whole feature subset size space.

E.g.,

if k_features='best' is selected with early_stop_rounds=0, it will evaluate and select the best feature subset in the range 1 to m.
if k_features='best' is selected with early_stop_rounds=2, it will evaluate and select the best feature subset in the range 1 to m but it may stop early, which means for forward selection, it may not explore the whole feature subset sizes up to m.

If a user selects e.g., k_features=3 and early_stop_rounds=2 we could

a) raise an error saying that f"Early stopping is not supported with fixed feature subset sizes."
b) raise a warning saying f"Early stopping may lead to feature subset sizes that are different from k_features={k_features}."

What are your thoughts?

Feb 02 '22 14:02 rasbt

Thanks for your suggestions, I agree with your points. I will edit this PR with the followings:

have only one parameter early_stop_rounds : int >= 0
enable early stopping only if early_stop_rounds > 0 and k_features in ['best', 'parsimonious']

Feb 02 '22 15:02 aldder

Sounds great! Looking forward to it!

Feb 02 '22 15:02 rasbt

Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:

Comment last updated at 2022-02-03 10:37:22 UTC

Feb 02 '22 15:02 pep8speaks

What is the status on this ?

I use the k_features="parsimonious" on my model. But it continues to add more and more features even after it is obvious the model will not improve, and after that it will select one of the very early model anywys.

I think this PR could get my runtime from 10 days into hours ;-)

Oct 02 '23 13:10 jimmy927

Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this.

Oct 14 '23 12:10 rasbt

SequentialFeatureSelection Early Stopping Criterion

Description

Pull Request Checklist

1)

2)

Comment last updated at 2022-02-03 10:37:22 UTC