SequentialFeatureSelector with estimator that requires a pandas DataFrame input
Describe the workflow you want to enable
When passing a pandas Dataframe into SequentialFeatureSelector, pass the dataframe into the estimator's fit method (sequential_feature_selector.py#L432) - not a numpy array as is currently done (sequential_feature_selector.py#L337).
Describe your proposed solution
Pass the provided X dataset into the _calc_score functions without changing into numpy arrays. If the dataset is a numpy array use X[:, k_idx] to select the columns. If the dataset is a pandas array use X.iloc[:, k_idx] to select the columns.
Describe alternatives you've considered, if relevant
I've tried using numpy arrays in my estimator, but it selects features based on their dtype. Pandas dataframes have unique dtypes per column, whereas numpy arrays lose this information and only have a global dtype.
I have also attempted to wrap the dataframe in the following class:
import pandas as pd
from mlxtend.feature_selector import SequentialFeatureSelector
class DataFrameWrapper(pd.DataFrame):
@property
def values(self):
return self
def __getitem__(self, indexers):
if isinstance(indexers, tuple):
indexers = tuple([
idx if not isinstance(idx, tuple) else list(idx)
for idx in indexers
])
return self.iloc[indexers]
SequentialFeatureSelector(estimator).fit(DataFrameWrapper(X), y)
Although this does seem to work, it is not very clean.
I like this overall idea. There is some rudimentary DataFrame support now (http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/#example-11-using-pandas-dataframes) but yeah, I think it converts to NumPy arrays internally.
So with your DataFrameWrapper, the data frame support could be more native.
Although this does seem to work, it is not very clean.
I agree that it might be a bit clunky for a user. However, how about adding this as a utility class to mlxtend and then calling it in SFS internally?
I.e., the user would still do
SequentialFeatureSelector(estimator).fit(X, y)
but if X is a DataFrame, it would use DataFrameWrapper internally? There is already some checking for pandas DataFrames inside the SFS code here where this could be added: https://github.com/rasbt/mlxtend/blob/d6eced453e524bc58e891d5958b2ddcbb42d97dd/mlxtend/feature_selection/sequential_feature_selector.py#L336
FYI: I ran into this incompatibility as well today, in particular because I use a custom estimator requiring access to X.index.
The workaround of @shane-breeze did fail on me initially, I think because a version of .copy is called here: https://github.com/rasbt/mlxtend/blob/77a9a27ffd9c70e6099859828e678a1988420b8c/mlxtend/feature_selection/utilities.py#L136-L142
This means that pd.DataFrame.copy is called, which by itself calls a constructor to return a new dataframe. This means that the type changes and that we lose the indexing capability workaround. To preserve this I added a copy function to the wrapper class. After that the workaround functioned again.
class DataFrameWrapper(pd.DataFrame):
@property
def values(self):
return self
def __getitem__(self, indexers):
if isinstance(indexers, tuple):
indexers = tuple([
idx if not isinstance(idx, tuple) else list(idx)
for idx in indexers
])
return self.iloc[indexers]
def copy(self, deep = True):
return DataFrameWrapper(super().copy(deep = deep))