sklearn-pandas Preserving column names when transformer requires multiple columns as input

trafficstars

I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.

from sklearn.feature_selection import SelectKBest, chi2

data = pd.DataFrame({
    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)

Which outputs ['children_salary'], whereas I would expect just ['salary']. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?

Oct 02 '18 08:10 hildeweerts

I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40.

Perhaps you can extend SelectKBest to provide this interface?

Marking as "good first issue" to come up with an example of this that works.

Oct 17 '18 18:10 dukebody

@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now.

As @dukebody proposed, it could be something like (I guess we need to pick k=1 if we want to choose the best column between two):

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper

class TrackingSelectKBest(SelectKBest):
    def fit(self, X, y=None):
        super().fit(X, y)
        scores = sorted([
            (i, score) for i, score in enumerate(self.scores_)],
            key=lambda pair: pair[1],
            reverse=True)
        self.best_columns_ = [i for i, score in scores[:self.k]]
        return self

def main():
    data = pd.DataFrame({
        'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
        'children': [4., 6, 3, 3, 2, 3, 5, 4],
        'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
    selector = TrackingSelectKBest(chi2, k=1)
    columns = np.array(['children', 'salary'])
    m = DataFrameMapper([(columns, selector)])
    m.fit_transform(data[columns], data['pet'])
    print(columns[selector.best_columns_])

if __name__ == '__main__':
    main()

The snippet about should print:

['salary']

Probably there are other ways to achieve this but scikit-learn transformers take numpy arrays without access to the original data frame column names so it is not possible to derive these names from the transformer.

Jan 29 '19 09:01 devforfu

@dukebody No updates on this ? Sounds like a really good idea to implement this

Feb 24 '20 06:02 Aashit-Sharma

Sklearn 1.0 estimator API has better support for feature_names. For example, using DataFrameMapper's df_in=True allows us to get:

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary']

Where children_salary was added by sklearn-pandas.

Same applies to, e.g., OneHotEncoder:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'col': [0, 0, 1, 1, 2, 3, 0],
    'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0', 
 'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3']

vs

mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3']

Would it be possible to take advantage of the sklearn's capabilities and improve the handling at DataFrameMapper.get_names?

Oct 17 '21 18:10 falcaopetri

sklearn-pandas sklearn-pandas copied to clipboard

Preserving column names when transformer requires multiple columns as input

sklearn-pandas
sklearn-pandas copied to clipboard