sklearn-pandas
sklearn-pandas copied to clipboard
Preserving column names when transformer requires multiple columns as input
I was wondering whether it is possible to preserve column names when using a transformer that requires multiple columns of the dataframe. I'll try to illustrate what I mean with an example.
from sklearn.feature_selection import SelectKBest, chi2
data = pd.DataFrame({
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90., 24, 44, 27, 32, 59, 36, 27]})
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))])
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
Which outputs ['children_salary']
, whereas I would expect just ['salary']
. This makes it impossible to keep track of which columns were dropped by the SelectKBest transformer. Is there currently a way to solve this problem?
I believe this should be possible if the transformer you use implements some interface to get the name of the resulting features. See https://github.com/scikit-learn-contrib/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py#L40.
Perhaps you can extend SelectKBest
to provide this interface?
Marking as "good first issue" to come up with an example of this that works.
@hildeweerts I would say that this kind of transformers is one of the most challenging ones to insert into a pipeline. The code we have now doesn't support this kind of flexibility so I guess the most simple way to do so is to manually track the changes. Not really convenient but I believe there are no other ways right now.
As @dukebody proposed, it could be something like (I guess we need to pick k=1
if we want to choose the best column between two):
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
from sklearn_pandas import DataFrameMapper
class TrackingSelectKBest(SelectKBest):
def fit(self, X, y=None):
super().fit(X, y)
scores = sorted([
(i, score) for i, score in enumerate(self.scores_)],
key=lambda pair: pair[1],
reverse=True)
self.best_columns_ = [i for i, score in scores[:self.k]]
return self
def main():
data = pd.DataFrame({
'pet': ['cat', 'dog', 'dog', 'fish', 'cat', 'dog', 'cat', 'fish'],
'children': [4., 6, 3, 3, 2, 3, 5, 4],
'salary': [90., 24, 44, 27, 32, 59, 36, 27]})
selector = TrackingSelectKBest(chi2, k=1)
columns = np.array(['children', 'salary'])
m = DataFrameMapper([(columns, selector)])
m.fit_transform(data[columns], data['pet'])
print(columns[selector.best_columns_])
if __name__ == '__main__':
main()
The snippet about should print:
['salary']
Probably there are other ways to achieve this but scikit-learn
transformers take numpy arrays without access to the original data frame column names so it is not possible to derive these names from the transformer.
@dukebody No updates on this ? Sounds like a really good idea to implement this
Sklearn 1.0 estimator API has better support for feature_names
. For example, using DataFrameMapper
's df_in=True
allows us to get:
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=2))], input_df=True)
mapper_fs.fit_transform(data[['children','salary']], data['pet'])
print(mapper_fs.transformed_names_)
['children_salary_children', 'children_salary_salary']
Where children_salary
was added by sklearn-pandas
.
Same applies to, e.g., OneHotEncoder
:
import pandas as pd
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'col': [0, 0, 1, 1, 2, 3, 0],
'target': [0, 0, 1, 1, 2, 3, 0]
})
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_x0_0', 'col_target_x0_1', 'col_target_x0_2', 'col_target_x0_3', 'col_target_x1_0',
'col_target_x1_1', 'col_target_x1_2', 'col_target_x1_3']
vs
mapper = DataFrameMapper([(['col', 'target'], OneHotEncoder())], input_df=True, df_out=True)
transformed = mapper.fit_transform(df)
print(mapper.transformed_names_)
['col_target_col_0', 'col_target_col_1', 'col_target_col_2', 'col_target_col_3', 'col_target_target_0', 'col_target_target_1', 'col_target_target_2', 'col_target_target_3']
Would it be possible to take advantage of the sklearn's capabilities and improve the handling at DataFrameMapper.get_names
?