probatus icon indicating copy to clipboard operation
probatus copied to clipboard

Implement correlated features elimination routine

Open Matgrb opened this issue 4 years ago • 7 comments

  • Implement correlated features elimination in feature_elimination module (stick to the class API: fit, compute, fit_compute and plot, you can see example in ShapRFECV)

The process of feature elimination could be iterative and follow the schema:

  1. Select highest correlating feature pair
  2. If above max_correlation_allowed then remove one using the provided strategy

The elimination strategy could be the following: - 'random' - random one of the 2 - 'centrality' - pick the one that is highly correlated to more other features - 'manual' - allow user to select one of the two

In the future we could also consider adding:

  • parameter indicating features that the user wants to keep regardless of correlation
  • add handling of categorical features

Example code that i used for similar purpose is below. It needs to be refactored and formulated into the probatus API.

from probatus.feature_elimination import ShapRFECV
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
import lightgbm
from scipy.stats import spearmanr
from probatus.stat_tests import DistributionStatistics
from pandas.api.types import is_numeric_dtype
from probatus.feature_elimination import ShapRFECV
from probatus.stat_tests import AutoDist
 
feature_names = ['f1_missing', 'f2_missing', 'f3_unstable', 'f4_unstable', 'f5_unstable', 'f6_correlated', 'f7_correlated', 'f8_correlated', 'f9', 'f10', 'f11', 'f12', 'f13', 'f14', 'f15', 'f16', 'f17', 'f18', 'f19', 'f20']
 
# Prepare two samples
X, y = make_classification(n_samples=1000, class_sep=0.05, n_informative=6, n_features=20,
                           random_state=1, n_redundant=10, n_clusters_per_class=1)
X = pd.DataFrame(X, columns=feature_names)
X['f1_missing'] = X['f1_missing'].apply(lambda x: x if np.random.rand()<0.5 else np.nan)
X['f2_missing'] = X['f2_missing'].apply(lambda x: x if np.random.rand()<0.8 else np.nan)
X['f7_correlated'] = X['f6_correlated'] + 0.5
X['f8_correlated'] = X['f6_correlated'] * 0.5
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
 
# Introduce shift in certain features in test
X_test['f3_unstable'] = X_test['f3_unstable'] + 2
X_test['f4_unstable'] = X_test['f4_unstable'] + 2
X_test['f5_unstable'] = X_test['f5_unstable'] + 2

def clean_correlation_matrix(corr_matrix: np.ndarray, replaced_value: int=-1) -> np.ndarray:
    """
    Overwrite all diagonal and bottom left from diagonal values from 2d correlation matrix with a replaced_value.
 
    Args:
        corr_matrix (np.array): 2d matrix with correlation values.
        replaced_value (Optional, integer): Value that is imputed, by default -1.
 
    Returns:
        (np.array) Correlation matrix
    """
    cleaned_corr_matrix = corr_matrix
 
    for i in range(len(cleaned_corr_matrix)):
        for j in range(len(cleaned_corr_matrix)):
            if i >= j:
                cleaned_corr_matrix[i, j] = replaced_value
 
    return cleaned_corr_matrix
 
def remove_correlated_features(corr_matrix: np.ndarray, correlation_threshold: float, feature_names: List[str]) -> Tuple[List[str], List[str], List[str]]:
    """
    Iteratively remove correlated features. It iteratively looks at highest correlated feature pair, and asks user to
    select, which feature to keep based on the expert knowledge. The process is continued until all pairs have
    correlation below correlation_threshold.
 
    Args:
        corr_matrix (np.array): 2D Correlation Matrix.
        correlation_threshold (float): Threshold above which correlated features are removed.
        feature_names (list of str): list of feature names.
 
    Returns:
        (list, list, list) Feature names remaining after removal, selected feature at each iteration of removel,
         and features removed at each iteration.
    """
    # Take absolute value of correlations
    corr_matrix = np.abs(corr_matrix)
 
    # Replace diagonal values
    clean_corr_matrix = clean_correlation_matrix(corr_matrix, -1)
 
    # Init variables
    selected_features = np.array([])
    removed_features = np.array([])
    remaining_features = np.array(feature_names)
 
    # Iteratively remove correlated features one by one if there is any correlation above threshold
    while np.max(clean_corr_matrix) > correlation_threshold:
        position_max_correlated_features = np.unravel_index(clean_corr_matrix.argmax(),
                                                            clean_corr_matrix.shape)
        print(f'Select one of the following features with correlation '
              f'{clean_corr_matrix[position_max_correlated_features]}: \n'
              f'0 for {remaining_features[position_max_correlated_features[0]]} \n'
              f'1 for {remaining_features[position_max_correlated_features[1]]}')
 
        selected_option = int(input())
        removed_option = 1 - selected_option
 
        # Append for tracking results
        removed_features = np.append(removed_features,
                                     remaining_features[position_max_correlated_features[removed_option]])
        selected_features = np.append(selected_features,
                                      remaining_features[position_max_correlated_features[selected_option]])
 
        print(f'Removing {remaining_features[position_max_correlated_features[removed_option]]}, Num of features left'
              f' {len(clean_corr_matrix) - 1}')
 
        # Remove the feature
        remaining_features = np.delete(remaining_features, position_max_correlated_features[removed_option])
        clean_corr_matrix = np.delete(clean_corr_matrix, position_max_correlated_features[removed_option], 0)
        clean_corr_matrix = np.delete(clean_corr_matrix, position_max_correlated_features[removed_option], 1)
    return list(remaining_features), list(selected_features), list(removed_features)
 
corr_matrix_numeric , p_val_matrix_neg_numeric = spearmanr(X_train[feature_names], nan_policy='omit')
remaining_features, selected_features, removed_features = remove_correlated_features(corr_matrix_numeric, 0.95, feature_names)
 
print(f'Dropping highly correlated features {removed_features}.')
X_train = X_train.drop(columns=removed_features)
X_test = X_test.drop(columns=removed_features)
feature_names = [feature for feature in feature_names if feature not in removed_features]

You can make a class in feature_elimination module. This class would look like this:

__init__(correlation_type='spearman' or 'pearson', correlation_threshold=0.95, **kwargs passed to correlation method)
fit(X, y)
compute(): -> pd.DataFrame
fit_compute(X,y)
plot() # Correlation matrix (possibly before and after)

Matgrb avatar Nov 12 '20 12:11 Matgrb

Could be nice to consider the removal based on the expect impact. For example two features correlated at 0.90 could have different correlation with the target. Then select the one with the highest correlation.

gverbock avatar Feb 17 '21 12:02 gverbock

@gverbock indeed that would be a nice additional strategy.

It requires passing y to the object, so I would start with the simpler strategies first. However, if we decide to develop it, I will make an issue for that!

Matgrb avatar Feb 17 '21 12:02 Matgrb

@Matgrb not sure what is the point of removing correlated features iteratively? Why not save some computational power are remove all feature above threshold H in one go after you have the correlation matrix?

@gverbock @Matgrb Regarding some more simple inteligence, you can create a feature rank, which measures number of pairs where a given feature has correlation above H and then use this rank as additional elimination rule. Assume u have X1, X2, X3 and X4. X1 is correlated with X4 and X1 is correlated with X2, other features are not correlated among each other. In this case it makes sense to remove X1 and not consider removing X4 or X2.

In regards to categorical features, there is no easy solution:

  • If you want to measure correlation between two categorical feats you can use Chi-sqr test, but there are a lot of options, especially if they are ordinal
  • If you want to measure correlation between categorial and continuous you can use ANOVA, or trend tests

I would propose to start with a simple implementation with Pearson correlation as described in the initial issue description. I can also add the ranking I mentioned above. If you agree, I can pick it up and make a PR.

warriordrago avatar Mar 31 '21 12:03 warriordrago

The main point for doing this iteratively is the following situation A, B, C, D features Correlations A-B 0.95, A-C 0.9, A-D 0.8, B-C 0.95, B-D 0.8, C-D 0.8 Correlated features above or equal to 0.95 are A and B. Let's say we remove iteratively, then we remove one of them only, and after that, we don't have to remove B, because it is not correlated to other features anymore, and has information that other features are missing. If we start with just removing all correlated features, we lose information, that both A and B had, that C and D were missing. I think doing it iteratively does not cause a lot of performance drop, because you already have precomputed correlated matrix for the entire dataset, and you only remove columns and rows from it, but it gives flexibility in case the user wants to select which feature to remove if there are two correlated ones.

An example of this is: the business prefers to work with features based features that they understand better, so from the 2 correlated ones they would choose, whichever suits them best.

I agree, let's start with Pearson correlation and numeric features. I would propose to also do this iteratively, and for now just select randomly one of the two. Later we can add other ways of selecting. Please make a class in feature_elimination, that follows the API in other classes e.g. ShapRFECV. init, fit, fit_compute, compute, plot. Similarly the parameter names in case they overlap. Also you can add docstrings already

Once you make a PR we can discuss there what other steps are needed e.g.

  • code example in docstring
  • unit tests
  • docs notebook

Matgrb avatar Mar 31 '21 12:03 Matgrb

Ok I see the point, I though you were recomputing the correlation after every step, which got me confused :) Can you assign the issue to me?

warriordrago avatar Mar 31 '21 13:03 warriordrago

If I understand correctly, feature-engine provides similar functionality. Do we still find this suggestion important or shall we close it, since it has been inactive for so long?

operte avatar Apr 11 '22 11:04 operte

I would close this issue and refer to feature_engine for feature elimination based on correlation. I believe it make more sense to suggest change in feature_engine existing functionality than build a new one for Probatus

gverbock avatar Apr 11 '22 19:04 gverbock

This issue is closed as feature engine provides this functionality.

ReinierKoops avatar Mar 20 '24 20:03 ReinierKoops