sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Preserving column names when transformer requires multiple columns as input

Open hildeweerts opened this issue 6 years ago • 4 comments

I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.

from sklearn.feature_selection import SelectKBest, chi2
​
data = pd.DataFrame({
    'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
    'children': [4., 6, 3, 3, 2, 3, 5, 4],
    'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
​
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)

Which outputs ['children_salary'], whereas I would expect just ['salary']. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?

hildeweerts avatar Oct 02 '18 08:10 hildeweerts

I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40.

Perhaps you can extend SelectKBest to provide this interface?

Marking as "good first issue" to come up with an example of this that works.

dukebody avatar Oct 17 '18 18:10 dukebody

@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now.

As @dukebody proposed, it could be something like (I guess we need to pick k=1 if we want to choose the best column between two):

import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper

class TrackingSelectKBest(SelectKBest):
    def fit(self, X, y=None):
        super().fit(X, y)
        scores = sorted([
            (i, score) for i, score in enumerate(self.scores_)],
            key=lambda pair: pair[1],
            reverse=True)
        self.best_columns_ = [i for i, score in scores[:self.k]]
        return self

def main():
    data = pd.DataFrame({
        'pet':      ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
        'children': [4., 6, 3, 3, 2, 3, 5, 4],
        'salary':   [90., 24, 44, 27, 32, 59, 36, 27]})
    selector = TrackingSelectKBest(chi2, k=1)
    columns = np.array(['children', 'salary'])
    m = DataFrameMapper([(columns, selector)])
    m.fit_transform(data[columns], data['pet'])
    print(columns[selector.best_columns_])

if __name__ == '__main__':
    main()

The snippet about should print:

['salary']

Probably there are other ways to achieve this but scikit-learn transformers take numpy arrays without access to the original data frame column names so it is not possible to derive these names from the transformer.

devforfu avatar Jan 29 '19 09:01 devforfu

@dukebody No updates on this ? Sounds like a really good idea to implement this

Aashit-Sharma avatar Feb 24 '20 06:02 Aashit-Sharma

Sklearn 1.0 estimator API has better support for feature_names. For example, using DataFrameMapper's df_in=True allows us to get:

mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary']

Where children_salary was added by sklearn-pandas.

Same applies to, e.g., OneHotEncoder:

import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
    'col': [0, 0, 1, 1, 2, 3, 0],
    'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0', 
 'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3']

vs

mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3']

Would it be possible to take advantage of the sklearn's capabilities and improve the handling at DataFrameMapper.get_names?

falcaopetri avatar Oct 17 '21 18:10 falcaopetri