sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Making DataFrameMapper compatible with GridSearchCV

Open devforfu opened this issue 5 years ago • 3 comments

In this PR, an attempt to implement the proposal from issue #159 is made. The idea is to write custom get_params and set_params methods that are compatible with scikit-learn grid search objects.

The following snippet shows features supported:

transformer = StandardScaler()
mapper = DataFrameMapper([
    (['colA'], StandardScaler()),
    (['colB'], StandardScaler()),
    ('colC', [StandardScaler(), FunctionTransformer()]
])
pipeline = Pipeline([
    ('mapper', mapper),
    ('classifier', SVC(kernel='linear'))
])

# 1. the pipeline parameters include parameters of the nested transformers
parameters = pipeline.get_params()
assert 'mapper__colA__with_mean' in parameters
assert 'mapper__colA__with_std' in parameters
assert 'mapper__colB__with_mean' in parameters
assert 'mapper__colB__with_std' in parameters

# 2. the parameters of nested transformers can be set from the outside
pipeline.set_params(
    mapper__colA__with_mean=True,
    mapper__colB__with_std=False
)

# 3. getting parameters from list of transformers
assert 'mapper__colC__standardscaler__with_mean' in parameters
assert 'mapper__colC__functiontransformer__func' in parameters

# 4. setting parameters to list of transformers
pipeline.set_params(
    mapper__colC__standardscaler__with_mean=True,
    mapper__colC__functiontransformer__func=np.log1p
)

# 5. grid search with parameters
param_grid = dict(
    mapper__colA__standardscaler__with_mean=[True, False],
    mapper__colB__standardscaler__with_std=[True, False],
    mapper__colC__functiontransformer__func=[np.log1p, np.exp, None]  
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)

We still need to add more tests and think about possible edge cases. For example, I am not sure how to handle this case:

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=1))])
pipeline = Pipeline([
    ('mapper', mapper_fs),
    ('classifier', SVC(kernel='linear'))
])
# how to handle transformers with several columns?
pipeline.set_params(...)

Also, I think that the current implementation of set_params could be revised/optimized, and we can handle get_params and set_params for cases when all the transformed columns have only a single transformer using Pipeline class instead of writing a custom code.

Would be glad to know your thoughts and proposals to finalize the PR and make the DataFrameMapper grid-search ready.

devforfu avatar Sep 05 '18 08:09 devforfu

Sorry to revive the issue but any chances of this being merged @devforfu? This would be a really nice feature to have.

prasoon2211 avatar Mar 12 '19 15:03 prasoon2211

你好,已收到,谢谢。

hu-minghao avatar Aug 27 '22 15:08 hu-minghao

@devforfu, thanks for the work.

I have copied all the code from the latest commit of your fork and tried it with scikit-learn==1.0.2.

this is how the parameters look when multiple column names are used.

>>> print(pipe_6_dbg5.get_params(deep=True).keys())
dict_keys(['default', 'df_out', 'features', 'input_df', 'sparse', "['AveRooms', 'AveBedrms', 'Population']__degree", "['AveRooms', 'AveBedrms', 'Population']__include_bias", "['AveRooms', 'AveBedrms', 'Population']__interaction_only", "['AveRooms', 'AveBedrms', 'Population']__order", "['AveRooms', 'AveBedrms', 'Population']__copy", "['AveRooms', 'AveBedrms', 'Population']__with_mean", "['AveRooms', 'AveBedrms', 'Population']__with_std", "['AveOccup', 'HouseAge']__copy", "['AveOccup', 'HouseAge']__norm"])

reproducible examples:

## getting data
from sklearn.datasets import fetch_california_housing
cal_house = fetch_california_housing(as_frame=True)
cal_house = pd.merge(left=cal_house['data'], right=cal_house['target'], left_index=True, right_index=True)

## making pipeline
from sklearn import pipeline, preprocessing

## `DataFrameMapper` code from https://github.com/devforfu/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py 
pipe_6_dbg5 = DataFrameMapper(features=[
                            (['AveRooms', 'AveBedrms', 'Population'],
                                preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
                            (['AveRooms', 'AveBedrms', 'Population'], preprocessing.StandardScaler()),
                            (['AveOccup', 'HouseAge'], preprocessing.Normalizer()),
                        ], default=None, df_out=True, input_df=True)

pipe_6_dbg5.fit(X=cal_house.drop(columns='MedHouseVal', axis=1), y=cal_house.loc[:,'MedHouseVal'])
print(pipe_6_dbg5.get_params(deep=True).keys())

params were similar(with square-brackets and single quotes in param names) when I inherited DataFrameMapper from latest version of sklearn-pandas and over-wrote your code of get_params and set_params

It would be a lot more useful, intuitive and practical if sklearn_pandas.DataFrameMapper also took a name like sklearn.compose.ColumnTransformer.

also, your code uses the column names to name the parameters. setting the params gets unintuitive when same column is used multiple times in different transformer steps. in my example code above, if both the StandardScaler and Normalizer used same set of column names, setting parameter like copy just doesn't make sense.

naveen-marthala avatar Aug 28 '22 06:08 naveen-marthala