mlxtend
mlxtend copied to clipboard
Replace `k_features` in `SequentialFeatureSelector` by `feature_range` and `recipe`
For ease of use and versatility, the k_features
parameter should be changed to
- feature_range: (min_val, max_val) or "all"
- recipe: "best" or "parsimonious"
Regarding feature_range
, if "all" is selected, feature subsets of all sizes will be considered as candidates for the best feature subset selected based on what's specified under recipe
.
Regarding recipe
, if "best" is provided, the feature selector will return the
feature subset with the best cross-validation performance.
If "parsimonious" is provided as an argument, the smallest
feature subset that is within one standard error of the
cross-validation performance will be selected.
I.e., if feature_range=(3, 5)
and recipe='best'
, the best feature subset with the best performance will be selected, and this feature subset can either have 3, 4, or 5 features.
Note that it would be best to deprecate k_features
and default it to None
. However if k_features
is not None
, it should have priority over the new parameters to avoid breaking existing code bases.
I have a dataset with 54 features. I am observing that using k_features = (25, 30) evaluates the model till all 54 features. Not logging a separate issue at the moment.
You can see here if you can get access:
https://www.kaggle.com/phsheth/ensemble-sequential-backward-selection?scriptVersionId=20009920
https://www.kaggleusercontent.com/kf/20009920/eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..vycxHb6-clsD7gXxIahuMA.9w0WyTfCh3sk2L-MkLCfIQOR5LIF-Hd_mSo5ivqZT2Pv556biCiHi7dRiaJL4rlXjwFFyboAF-vLrSU98hBJbeCiaWY7v0DqwIjnHg-51CSfgQe4Djy_MwHTdLQ4FgtTatJXG83GLuoK_8mDx4j0FVas4ZoxA7YperBIBjiuaLA.g1QNBOoCLpFT-NEBXL800g/__results___files/__results___16_2.png
Hm that's weird and shouldn't happen. I just ran a quick example and couldn't reproduce this issue:
E.g., for backward selection:
from sklearn.neighbors import KNeighborsClassifier
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
import numpy as np
np.random.seed(123)
X = np.random.random((100, 50))
y = np.zeros((100)).astype(int)
y[50:] = 1
knn = KNeighborsClassifier(n_neighbors=3)
sfs1 = SFS(knn,
k_features=(20, 30),
forward=False,
floating=False,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())
it returns:
Size of best selected subbset: 26
All feature subset sizes: dict_keys([50, 49, 48, 47, 46, 45, 44, 43, 42, 41, 40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20])
And for forward selection, it also seems fine:
sfs1 = SFS(knn,
k_features=(20, 30),
forward=True,
floating=True,
scoring='accuracy',
cv=0)
sfs1 = sfs1.fit(X, y)
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
print('All feature subset sizes:', sfs1.subsets_.keys())
returns
Size of best selected subbset: 23
All feature subset sizes: dict_keys([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30])
I tried the floating variants as well and there is no issue. How did you run the SFS exactly, and what version of mlxtend are you using?
You can check via
import mlxtend
mlxtend.__version__
Using version '0.17.0' (imports directly on kaggle kernel - i did not have to add the mlxtend library there)
seqbacksel_rf = SFS(classifier_rf, k_features = (25, 30),
forward = False, floating = False,
scoring = 'accuracy', cv = 5,
n_jobs = -1)
seqbacksel_rf = seqbacksel_rf.fit(train_X, train_y.values.ravel())
print('best combination (ACC: %.3f): %s\n' % (seqbacksel_rf.k_score_, seqbacksel_rf.k_feature_idx_))
print('all subsets:\n', seqbacksel_rf.subsets_)
plot_sfs(seqbacksel_rf.get_metric_dict(), kind='std_err');
/opt/conda/lib/python3.6/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
best combination (ACC: 0.886): (0, 3, 4, 5, 6, 7, 9, 10, 11, 12, 14, 15, 16, 17, 19, 23, 27, 29, 30, 31, 35, 36, 42, 43, 44, 45, 46, 50, 51)
all subsets:
...
I can't spot an issue in the example above, it looks all fine to me. So, based on the plot above, I would expect the SFS to return a subset of size 29, is this correct? You can double-check via the following code:
print('Size of best selected subbset:', len(sfs1.k_feature_idx_))
(it should print 29, which would then be within the k_features = (25, 30) range you specified)
ok sir. my apologies. I thought the code meant mlxtend should not evaluate the model using features above 30. It does evaluate but does not report a subset above 30 features.
Oh, maybe this was a misunderstanding then.
Say you set k_features=(25, 30)
.
- If you use forward selection, it will start with 0 features and then evaluate all features up to 30 features and then select the best feature combination from the range between 25, 30
- if you use backward selection (like in your example), it will start with all features (here: 54) and eliminate features until 25 features are left. Then, it will select the best feature combination from the range between 25, 30
hope this addresses the issue!?
I understand now sir. Thanks for clarifying. I did read about backward selection but the fundamental got lost somewhere from my mind.
No worries, and I am glad to hear that there's no bug :)