scikit-learn API options for Pandas output

Related to:

https://github.com/scikit-learn/scikit-learn/issues/5523 pandas in, pandas out
https://github.com/scikit-learn/scikit-learn/issues/10603 typical data science use case
https://github.com/scikit-learn/scikit-learn/pull/20100 array out in preprocessing
#20110 output dataframes in column transformer

This issue summarizes all the options for pandas with a normal data science use case:

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])

In all of the following options, pipe[-1].feature_names_in_ is used to get the feature names used in LogisticRegression. All options require feature_names_in_ to enforce column name consistency between fit and transform.

Option 1: `output` kwargs in `transform`

All transformers will accept a output='pandas' in transform. To configure transformers to output dataframes during fit:

# passes `output="pandas"` to all steps during `transform`
pipe.fit(X_train_df, transform_output="pandas")

# output of preprocessing in pandas
pipe[-1].transform(X_train_df, output="pandas")

Pipeline will pass output="pandas" to every transform method during fit. The original pipeline did not need to change. This option requires meta-estimators with transformers such as Pipeline and ColumnTransformer to pass output="pandas" to every transformer.transform.

Option 2: `init` parameter

All transformers will accept an transform_output in __init__:

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median',
                              transform_output="pandas")),
    ('scaler', StandardScaler(transform_output="pandas"))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore', transform_output="pandas")

preprocessor = ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)],
    transform_output="pandas")

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression())])
          
# All transformers are configured to output dataframes
pipe.fit(X_train_df)

Option 2b: Have a global config to `transform_output`

For a better user experience, we can have a global config. By default, transform_output is set to 'global' in all transformers.

import sklearn
sklearn.set_config(transform_output="pandas")

pipe = ...
pipe.fit(X_train_df)

Option 3: Use SLEP 006

Have all transformers request output. Similiar to Option 1, every transformer needs a output='pandas' kwarg in transform.

numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median').request_for_transform(output=True)),
    ('scaler', StandardScaler().request_for_transform(output=True))])

categorical_transformer = OneHotEncoder(handle_unknown='ignore').request_for_transform(output=True)

preprocessor = (ColumnTransformer([
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])
        .request_for_transform(output=True))

pipe = Pipeline([('preprocessor', preprocessor), ('classifier', LogisticRegression()])
                      
pipe.fit(X_train_df, output="pandas")

Option 3b: Have a global config for request

For a better user experience, we can have a global config:

import sklearn
sklearn.set_config(request_for_transform={"output": True})

pipe = ...
pipe.fit(X_train_df, output="pandas")

Summary

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

CC: @amueller @ogrisel @glemaitre @adrinjalali @lorentzenchr @jnothman @GaelVaroquaux

Jun 13 '21 21:06 thomasjpfan

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

I think from the user's perspective, option 2 makes more sense since it's not really a request.

Also, when I think about third party meta estimators, I'm not sure which option is better.

Jun 14 '21 16:06 adrinjalali

The issue with the global config is that we haven't figured how to fix that nicely in a multi process setting, have we?

In the context of scikit-learn we have a workaround that works:

https://github.com/scikit-learn/scikit-learn/blob/6d67937b3ce28fd3fc966d3d417df56c08c98502/sklearn/utils/fixes.py#L187-L205

Jun 15 '21 08:06 ogrisel

I have the feeling that option 3 would be unnecessary verbose.

Option 2 and option 2b are not necessarily mutually exclusive no?

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

Jun 15 '21 08:06 ogrisel

For larger pipelines, option 1 is my personal favorite as a user.

Jun 15 '21 09:06 lorentzenchr

Options 2 and 3 are very similiar because it requires every transformer to be adjusted. This is not the best API/UX. Options 2b and 3b tries to simplify the API with a global config. Overall, I think Option 1 has the best user experience.

I agree with your analysis.

Would it be interesting to have a version of option 1 where the default behavior is controlled by a global flag and is overridden by passing an argument to the transformer?

Jun 15 '21 11:06 GaelVaroquaux

From an implementation point of option 2b (and maybe option 2) would impose the use of a decorator on all transformers right? Or we would provide the implementation of a public transform method in TransformerMixin and ask the subclasses to implement a private _transform abstract method. My worry is how to handle the docstring and not break IDE autocomplete based on static code inspection.

@ogrisel Option 2b without the __init__ parameter is very close to my original PR with a global config: https://github.com/scikit-learn/scikit-learn/pull/16772 . I think we decided not to go down the path of having a global config.

As for implementation, I would prefer not to hide it into a mixin and prefer something like https://github.com/scikit-learn/scikit-learn/pull/20100. The idea is to use self._validate_data to record the column names, and a decorator around transform handle wrapping the output into a pandas dataframe. As an alternative, I can see a more symmetric approach that does not rely on self._validate_data where we have two decorators, one for fit: record_column_names and for transform: wrap_transform.

Jun 26 '21 14:06 thomasjpfan

Can we close as SLEP018 was accepted?

Sep 16 '22 07:09 lorentzenchr

I agree, we can close this issue.

Sep 26 '22 01:09 thomasjpfan

scikit-learn scikit-learn copied to clipboard

API options for Pandas output

Option 1: output kwargs in transform

Option 2: __init__ parameter

Option 2b: Have a global config to transform_output

Option 3: Use SLEP 006

Option 3b: Have a global config for request

Summary

scikit-learn
scikit-learn copied to clipboard

Option 1: `output` kwargs in `transform`

Option 2: `init` parameter

Option 2b: Have a global config to `transform_output`