mlxtend icon indicating copy to clipboard operation
mlxtend copied to clipboard

SequentialFeatureSelector with estimator that requires a pandas DataFrame input

Open shane-breeze opened this issue 3 years ago • 4 comments

Describe the workflow you want to enable

When passing a pandas Dataframe into SequentialFeatureSelector, pass the dataframe into the estimator's fit method (sequential_feature_selector.py#L432) - not a numpy array as is currently done (sequential_feature_selector.py#L337).

Describe your proposed solution

Pass the provided X dataset into the _calc_score functions without changing into numpy arrays. If the dataset is a numpy array use X[:, k_idx] to select the columns. If the dataset is a pandas array use X.iloc[:, k_idx] to select the columns.

Describe alternatives you've considered, if relevant

I've tried using numpy arrays in my estimator, but it selects features based on their dtype. Pandas dataframes have unique dtypes per column, whereas numpy arrays lose this information and only have a global dtype.

shane-breeze avatar May 24 '22 11:05 shane-breeze

I have also attempted to wrap the dataframe in the following class:

import pandas as pd
from mlxtend.feature_selector import SequentialFeatureSelector

class DataFrameWrapper(pd.DataFrame):
    @property
    def values(self):
        return self

    def __getitem__(self, indexers):
        if isinstance(indexers, tuple):
            indexers = tuple([
                idx if not isinstance(idx, tuple) else list(idx)
                for idx in indexers
            ])
        return self.iloc[indexers]


SequentialFeatureSelector(estimator).fit(DataFrameWrapper(X), y)

Although this does seem to work, it is not very clean.

shane-breeze avatar May 24 '22 12:05 shane-breeze

I like this overall idea. There is some rudimentary DataFrame support now (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#example-11-using-pandas-dataframes) but yeah, I think it converts to NumPy arrays internally.

So with your DataFrameWrapper, the data frame support could be more native.

Although this does seem to work, it is not very clean.

I agree that it might be a bit clunky for a user. However, how about adding this as a utility class to mlxtend and then calling it in SFS internally?

I.e., the user would still do

SequentialFeatureSelector(estimator).fit(X, y)

but if X is a DataFrame, it would use DataFrameWrapper internally? There is already some checking for pandas DataFrames inside the SFS code here where this could be added: https://github.com/rasbt/mlxtend/blob/d6eced453e524bc58e891d5958b2ddcbb42d97dd/mlxtend/feature_selection/sequential_feature_selector.py#L336

rasbt avatar May 25 '22 13:05 rasbt

FYI: I ran into this incompatibility as well today, in particular because I use a custom estimator requiring access to X.index. The workaround of @shane-breeze did fail on me initially, I think because a version of .copy is called here: https://github.com/rasbt/mlxtend/blob/77a9a27ffd9c70e6099859828e678a1988420b8c/mlxtend/feature_selection/utilities.py#L136-L142 This means that pd.DataFrame.copy is called, which by itself calls a constructor to return a new dataframe. This means that the type changes and that we lose the indexing capability workaround. To preserve this I added a copy function to the wrapper class. After that the workaround functioned again.

class DataFrameWrapper(pd.DataFrame):
    @property
    def values(self):
        return self

    def __getitem__(self, indexers):
        if isinstance(indexers, tuple):
            indexers = tuple([
                idx if not isinstance(idx, tuple) else list(idx)
                for idx in indexers
            ])
        return self.iloc[indexers]
        
    def copy(self, deep = True):
        return DataFrameWrapper(super().copy(deep = deep))

chiemvs avatar Sep 20 '23 15:09 chiemvs