sklearn-pandas
sklearn-pandas copied to clipboard
Making DataFrameMapper compatible with GridSearchCV
In this PR, an attempt to implement the proposal from issue #159 is made. The idea is to write custom get_params
and set_params
methods that are compatible with scikit-learn
grid search objects.
The following snippet shows features supported:
transformer = StandardScaler()
mapper = DataFrameMapper([
(['colA'], StandardScaler()),
(['colB'], StandardScaler()),
('colC', [StandardScaler(), FunctionTransformer()]
])
pipeline = Pipeline([
('mapper', mapper),
('classifier', SVC(kernel='linear'))
])
# 1. the pipeline parameters include parameters of the nested transformers
parameters = pipeline.get_params()
assert 'mapper__colA__with_mean' in parameters
assert 'mapper__colA__with_std' in parameters
assert 'mapper__colB__with_mean' in parameters
assert 'mapper__colB__with_std' in parameters
# 2. the parameters of nested transformers can be set from the outside
pipeline.set_params(
mapper__colA__with_mean=True,
mapper__colB__with_std=False
)
# 3. getting parameters from list of transformers
assert 'mapper__colC__standardscaler__with_mean' in parameters
assert 'mapper__colC__functiontransformer__func' in parameters
# 4. setting parameters to list of transformers
pipeline.set_params(
mapper__colC__standardscaler__with_mean=True,
mapper__colC__functiontransformer__func=np.log1p
)
# 5. grid search with parameters
param_grid = dict(
mapper__colA__standardscaler__with_mean=[True, False],
mapper__colB__standardscaler__with_std=[True, False],
mapper__colC__functiontransformer__func=[np.log1p, np.exp, None]
)
grid_search = GridSearchCV(pipeline, param_grid=param_grid)
grid_search.fit(X, y)
We still need to add more tests and think about possible edge cases. For example, I am not sure how to handle this case:
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, chi2
mapper_fs = DataFrameMapper([(['children','salary'], SelectKBest(chi2, k=1))])
pipeline = Pipeline([
('mapper', mapper_fs),
('classifier', SVC(kernel='linear'))
])
# how to handle transformers with several columns?
pipeline.set_params(...)
Also, I think that the current implementation of set_params
could be revised/optimized, and we can handle get_params
and set_params
for cases when all the transformed columns have only a single transformer using Pipeline
class instead of writing a custom code.
Would be glad to know your thoughts and proposals to finalize the PR and make the DataFrameMapper
grid-search ready.
Sorry to revive the issue but any chances of this being merged @devforfu? This would be a really nice feature to have.
你好,已收到,谢谢。
@devforfu, thanks for the work.
I have copied all the code from the latest commit of your fork and tried it with scikit-learn==1.0.2
.
this is how the parameters look when multiple column names are used.
>>> print(pipe_6_dbg5.get_params(deep=True).keys())
dict_keys(['default', 'df_out', 'features', 'input_df', 'sparse', "['AveRooms', 'AveBedrms', 'Population']__degree", "['AveRooms', 'AveBedrms', 'Population']__include_bias", "['AveRooms', 'AveBedrms', 'Population']__interaction_only", "['AveRooms', 'AveBedrms', 'Population']__order", "['AveRooms', 'AveBedrms', 'Population']__copy", "['AveRooms', 'AveBedrms', 'Population']__with_mean", "['AveRooms', 'AveBedrms', 'Population']__with_std", "['AveOccup', 'HouseAge']__copy", "['AveOccup', 'HouseAge']__norm"])
reproducible examples:
## getting data
from sklearn.datasets import fetch_california_housing
cal_house = fetch_california_housing(as_frame=True)
cal_house = pd.merge(left=cal_house['data'], right=cal_house['target'], left_index=True, right_index=True)
## making pipeline
from sklearn import pipeline, preprocessing
## `DataFrameMapper` code from https://github.com/devforfu/sklearn-pandas/blob/master/sklearn_pandas/dataframe_mapper.py
pipe_6_dbg5 = DataFrameMapper(features=[
(['AveRooms', 'AveBedrms', 'Population'],
preprocessing.PolynomialFeatures(degree=2, include_bias=False)),
(['AveRooms', 'AveBedrms', 'Population'], preprocessing.StandardScaler()),
(['AveOccup', 'HouseAge'], preprocessing.Normalizer()),
], default=None, df_out=True, input_df=True)
pipe_6_dbg5.fit(X=cal_house.drop(columns='MedHouseVal', axis=1), y=cal_house.loc[:,'MedHouseVal'])
print(pipe_6_dbg5.get_params(deep=True).keys())
params
were similar(with square-brackets and single quotes in param names) when I inherited DataFrameMapper
from latest version of sklearn-pandas
and over-wrote your code of get_params
and set_params
It would be a lot more useful, intuitive and practical if sklearn_pandas.DataFrameMapper
also took a name like sklearn.compose.ColumnTransformer
.
also, your code uses the column names to name the parameters. setting the params gets unintuitive when same column is used multiple times in different transformer steps. in my example code above, if both the StandardScaler
and Normalizer
used same set of column names, setting parameter like copy
just doesn't make sense.