probatus
probatus copied to clipboard
ShapRFECV implement best features set
Implement automatic selection of get_reduced_features_set()
-> one with highest Validation AUC and lowest number of features. Next to this, we should implement arguments as in RFECV, namely support_ and ranking_, where the users can get the final features set without the need to manually select the best features set.
In the end, the user should select the number of features manually, however, for consistency and usability, we can add the function to perform it automatically. Optionally we can print a warning, whenever it is done.
Part of this issue is updating the tests, and docs (notebook with tutorial).
Generally the number of features and performance would form an convex curve. With the performance to be lower with all features, increasing(or staying constant) with the removal of certain features and the decreasing while only a few features left. Mostly there wouldn't be one best feature set but multiple. E.g You have a feature set with 8 features say 0.85 AUC and feature set with 5 features with 0.84 AUC . Both are good solutions in at least one of the metrics.
So instead of showing one best feature set , we can show top n sets , which are better than rest of the solutions. Basically implement the pareto front.
What do you think ?
In my experience with the feature:
- the shape of the curve has quite a high std, due to the hyperparam opt at each step, which if you don't do enough iterations, will fluctuate. One interesting approach would be smoothing of the curve before selecting the best set. However, i would be careful with that, because maybe removing a specific feature prevents overfitting, therefore, leads to better performance, which could be indicated by the fluctuation.
- When you use the feature at the end, you either manually select the best feature set based on the graph, or you should be able to go very fast and just pick the one with the highest AUC and lowest std. This would be the convenience method for the users.
Your suggestion is something different, more of get_candidate_reduced_sets(num_of_top_sets)
. It is also a nice approach, but i would treat it as extra one.
Also, the goal in this issue is to be a bit more consistent with sklearn RFECV feature and implement similar methods. This way if the users know the other class, it will be a bit easier to use ShapRFECV.
I would +1 the proposal above to make ShapRFECV more aligned with RFECV. Two methods in particular:
-
fit_transform(X, y)
that returns X with reduced feature set -
get_reduced_features_set()
orget_feature_names_out()
(the latter is the sklearn name convention)
Agree that it might be appropriate to manually inspect in some cases. This should be a user decision - and the above PR would be explicitly for users who opt into and understand constraints of an automatic method. My 2 cents from principle POV.
One library that is worth reviewing is mlxtend.feature_selection.SequentialFeatureSelector
. They've also wrestled with this challenge IMO and identified the following ways to enable users to automatically select best features:
- Specify the exact number of best_features to select with best CV performance (eg. k_features=3) [this is what probatus has today]
- Specify a range that the best number of features should be selected between with best CV performance (eg. k_features = (3, 5))
- Specify k_features="best" which finds the feature subset with the best cross-validation performance
- Specify k_features="parsimonious" which finds the smallest feature subset that is within one standard error of the cross-validation performance will be selected
Source: mlxtendhttps://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector
But for a V1, I think adopting same/similar methods as sklearn RFECV would be a great addition. Familiarity with sklearn will simplify adoption of users coming from sklearn side.
Do we have alignment on this proposal and what API should be supported? Or still an open question from core team side?
Thanks for a great library.
@markdregan thanks for your contribution in https://github.com/ing-bank/probatus/pull/220