feature_engine
feature_engine copied to clipboard
Support the use of SHAP values to get feature importances in ProbeFeatureSelection
First of all, thanks for this package, I've been using it for some time to do feature engineering and it's awesome.
Is your feature request related to a problem? Please describe.
I think I found a problem with the ProbeFeatureSelection algorithm. This algorithm uses the feature_importances
of the SKLearn estimator to select the features that have greater importance than the Probe features.
If you choose a RandomForestClassifier as the estimator and you are trying to perform binary classification, the feature_importances
will tend to prefer high cardinality features (https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#:~:text=Feature%20importances%20are%20provided%20by,impurity%20decrease%20within%20each%20tree.&text=Impurity%2Dbased%20feature%20importances%20can,features%20(many%20unique%20values)
I found this issue while I was testing this algorithm with toy data:
X = pd.DataFrame({
"feature1": [0, 1, 0, 1, 0],
"feature2": [6, 7, 8, 9, 10],
"feature3": [11, 12, 13, 14, 15],
"feature4": [16, 17, 18, 19, 20],
"feature5": [21, 22, 23, 24, 25],
})
y = pd.Series([0, 1, 0, 1, 0])
In this example, feature1
is the same as y
(correlation of 1.0), so the algorithm should choose feature1
as an important feature right? If we just make one iteration of the algorithm, it will choose feature1
and feature2
with this setting:
X, y = sample_X_y
selector = ProbeFeatureSelection(
estimator=RandomForestClassifier(max_depth=2, random_state=150),
n_probes=1,
distribution="uniform",
random_state=150,
confirm_variables=False,
cv=2,
)
result = probe_feature_selection(selector, X, y)
But if we run PROBE two more times, we will be left with an empty df (no features with greater importance than random uniform feature)
def probe_feature_selection(selector: ProbeFeatureSelection, X: pd.DataFrame, y: pd.Series) -> pd.DataFrame:
"""Perform PROBE feature selection using the given selector on the input data.
Args:
selector (ProbeFeatureSelection): The feature selection selector.
X (pd.DataFrame): The input data.
y (pd.Series): The target variable.
Returns:
pd.DataFrame: The transformed input data after feature selection.
"""
feature_decrease = True
iterations = 1
while feature_decrease and len(X.columns) > 0:
n_initial_features = len(X.columns)
selector.fit(X, y)
X = selector.transform(X)
n_final_features = len(X.columns)
feature_decrease = n_initial_features > n_final_features
logging.info(f"Iteration {iterations}: {n_initial_features} -> {n_final_features}")
iterations += 1
return X
Describe the solution you'd like
I think a possible solution would be to add the option of using SHAP values instead of SKLearn feature_importances
to select the features with greater importance than the PROBEs.
I think it's possible to use something more robust than rf internal feature importance (which is just feature usage counter) and somethig quicker than SHAP.
The problem lies in research factor - I don't think that we know what exactly will give the best result here with minumum number of caveats.
P.S. I think there is one way of mitigating this unwanted behaviour - using binarization before fitting the model. This way it will cap the number of unique values, which should help. Just like GBDT do.
I also see that RF feature importance has its limitations, i.e., correlated features will show half the importance than they would if used in isolation. And hence, they might be lost to the probes.
sklearn uses importance gain as a measure if importance, not just counts. Feature count is used by other implementations though, like xgb and lightGBM.
SHAP values also have their limitations. They approximate importance with a function that is not really related to RF workings. So at the end of the day, it's just another approximation. Plus, adding dependencies makes the library harder to maintain. I am already struggling with pandas and sklearn constant new releases.
We could try adding importance derived from single feature models. Like the functionality that we have in single feature selector: https://feature-engine.trainindata.com/en/latest/user_guide/selection/SelectBySingleFeaturePerformance.html
Thoughts?
I think we shouldn't add more dependencies