mlxtend
mlxtend copied to clipboard
SequentialFeatureSelection Early Stopping Criterion
Description
According to some studies reported on papers (like this: https://www.researchgate.net/publication/220804144_Overfitting_in_Wrapper-Based_Feature_Subset_Selection_The_Harder_You_Try_the_Worse_it_Gets), the feature selection methodologies known as Wrapper suffer from overfitting as the number of explored states increases. A method to reduce this overfitting is to use automatic stop criteria (early stop, as the one most known for neural networks). In this PR I have implemented the criterion of early stopping for the class SequentialFeatureSelector.
One parameter has been added in during instantiation of the object:
early_stop_rounds : int (default 0)
Enable early stopping criterion when > 0, this value determines the
number of iterations after which, if no performance boost has been
seen, execution is stopped.
Used only when `k_features == 'best'` or `k_features == 'parsimonious'`
Code Example:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
np.random.seed(0)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
# add some noise in order to have features to discard
X_iris_with_noise = np.concatenate(
(X_iris,
np.random.randn(X_iris.shape[0], X_iris.shape[1])),
axis=1)
knn = KNeighborsClassifier()
sfs = SFS(
estimator=knn,
k_features='best',
forward=True,
early_stop_rounds=0,
verbose=0)
sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());
sfs = SFS(
estimator=knn,
k_features='best',
forward=True,
early_stop_rounds=2,
verbose=0)
sfs.fit(X_iris_with_noise, y_iris)
plot_sfs(sfs.get_metric_dict());
... Performances not improved for 2 rounds. Stopping now!
Pull Request Checklist
- [ ] Added a note about the modification or contribution to the
./docs/sources/CHANGELOG.md
file (if applicable) - [x] Added appropriate unit test functions in the
./mlxtend/*/tests
directories (if applicable) - [ ] Modify documentation in the corresponding Jupyter Notebook under
mlxtend/docs/sources/
(if applicable) - [ ] Ran
PYTHONPATH='.' pytest ./mlxtend -sv
and make sure that all unit tests pass (for small modifications, it might be sufficient to only run the specific test file, e.g.,PYTHONPATH='.' pytest ./mlxtend/classifier/tests/test_stacking_cv_classifier.py -sv
) - [x] Checked for style issues by running
flake8 ./mlxtend
Thanks for the PR! I agree that overfitting can become an issue. Currently, there is the option
-
k_features="parsimonious"
which will select the smallest feature set within 1 standard error of the best feature set, which helps with this.
I like adding an early_stop
option, but I have a few suggestions / concerns regarding the API:
1)
I think that the two parameters
early_stop
and early_stop_rounds
can be consolidated into a single one. E.g.,
if self.early_stop and k != k_to_select:
if k_score <= best_score:
could be
if self.early_stop_rounds and k != k_to_select:
if k_score <= best_score:
What I mean is instead of having
-
early_stop : bool (default: False)
-
early_stop_rounds : int (default 3)
this could be simplified to
-
early_stop_rounds : int (default 0)
2)
The second concern I have is that if a user selects e.g., k_features=(1, 3), early_stop_rounds=3
, it's not necessarily guaranteed that there will be 1 to 3 features, which can be confusing.
I wonder if it makes sense to allow early_stopping_rounds
only for k_features='best'
and k_features='parsimonious'
, which both explore the whole feature subset size space.
E.g.,
- if
k_features='best'
is selected withearly_stop_rounds=0
, it will evaluate and select the best feature subset in the range 1 to m. - if
k_features='best'
is selected withearly_stop_rounds=2
, it will evaluate and select the best feature subset in the range 1 to m but it may stop early, which means for forward selection, it may not explore the whole feature subset sizes up to m.
If a user selects e.g., k_features=3 and early_stop_rounds=2
we could
- a) raise an error saying that
f"Early stopping is not supported with fixed feature subset sizes."
- b) raise a warning saying
f"Early stopping may lead to feature subset sizes that are different from k_features={k_features}."
What are your thoughts?
Thanks for your suggestions, I agree with your points. I will edit this PR with the followings:
- have only one parameter
early_stop_rounds
: int >= 0 - enable early stopping only if
early_stop_rounds
> 0 andk_features
in ['best', 'parsimonious']
Sounds great! Looking forward to it!
Hello @aldder! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:
There are currently no PEP 8 issues detected in this Pull Request. Cheers! :beers:
Comment last updated at 2022-02-03 10:37:22 UTC
What is the status on this ?
I use the k_features="parsimonious" on my model. But it continues to add more and more features even after it is obvious the model will not improve, and after that it will select one of the very early model anywys.
I think this PR could get my runtime from 10 days into hours ;-)
Thanks for the ping, and I need to look into this some time -- sorry, haven't had a chance recently due to too many other commitments. Unfortunately, I currently don't have a timeline for this.