scikit-learn
scikit-learn copied to clipboard
API options for Pandas output
Related to:
- https://github.com/scikit-learn/scikit-learn/issues/5523 pandas in, pandas out
- https://github.com/scikit-learn/scikit-learn/issues/10603 typical data science use case
- https://github.com/scikit-learn/scikit-learn/pull/20100 array out in preprocessing
- #20110 output dataframes in column transformer
This issue summarizes all the options for pandas with a normal data science use case:
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())])
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.
Option 1: output kwargs in transform
All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:
# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")
# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")
Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.
Option 2: __init__ parameter
All transformers will accept an transform_output in __init__:
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median',
transform_output="pandas")),
('scaler', StandardScaler(transform_output="pandas"))])
categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")
preprocessor = ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)],
transform_output="pandas")
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
# All transformers are configured to output dataframes
pipe.fit(X_train_df)
Option 2b: Have a global config to transform_output
For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.
import sklearn
sklearn.set_config(transform_output="pandas")
pipe = ...
pipe.fit(X_train_df)
Option 3: Use SLEP 006
Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
('scaler', StandardScaler().request_for_transform(output=True))])
categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)
preprocessor = (ColumnTransformer([
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)])
.request_for_transform(output=True))
pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
pipe.fit(X_train_df, output="pandas")
Option 3b: Have a global config for request
For a better user experience, we can have a global config:
import sklearn
sklearn.set_config(request_for_transform={"output": True})
pipe = ...
pipe.fit(X_train_df, output="pandas")
Summary
Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.
CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux
The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?
I think from the user's perspective, option 2 makes more sense since it's not really a request.
Also, when I think about third party meta estimators, I'm not sure which option is better.
The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?
In the context of scikit-learn we have a workaround that works:
https://github.com/scikit-learn/scikit-learn/blob/6d67937b3ce28fd3fc966d3d417df56c08c98502/sklearn/utils/fixes.py#L187-L205
I have the feeling that option 3 would be unnecessary verbose.
Option 2 and option 2b are not necessarily mutually exclusive no?
From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.
For larger pipelines, option 1 is my personal favorite as a user.
Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.
I agree with your analysis.
Would it be interesting to have a version of option 1 where the default behavior is controlled by a global flag and is overridden by passing an argument to the transformer?
From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.
@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: https://github.com/scikit-learn/scikit-learn/pull/16772 . I think we decided not to go down the path of having a global config.
As for implementation, I would prefer not to hide it into a mixin and prefer something like https://github.com/scikit-learn/scikit-learn/pull/20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.
Can we close as SLEP018 was accepted?
I agree, we can close this issue.