sklearn-pandas icon indicating copy to clipboard operation
sklearn-pandas copied to clipboard

Use new transformer.get_feature_names_out function

Open falcaopetri opened this issue 3 years ago • 10 comments

Transformer's get_output_names is getting deprecated in favor of get_feature_names_out. It will be removed by sklearn 1.2 (see sklearn v1.0 changelog, scikit-learn/scikit-learn#18444, and, for example, OneHotEncoder.get_feature_names).

This PR:

  • Prefers estimator.get_feature_names_out() over estimator.get_features_names()
  • Configure nox to run tests with both scikit-learn 0.23 and 1.0

The change currently breaks the README#Dynamic Columns example. This happens because there is no StandardScaler.get_features_names in either sklearn 0.23 or 1.0:

  • Current example's transformed_names_ is ['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1'].
  • In sklearn 1.0 though, there is a StandardScaler.get_features_names_out, which is used in this PR and therefore produces the output ['x_x0', 'x_x1', 'x_x2', 'x_x3', 'petal_0', 'petal_1'].

falcaopetri avatar Oct 17 '21 18:10 falcaopetri

FYI: it seems sklearn >= 1.0 requires Python>=3.7.

ragrawal avatar Oct 18 '21 04:10 ragrawal

One thing I noted is that current implementation is already a little bit inconsistent within sklearn 0.23:

import pandas as pd
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
mapper = DataFrameMapper([
    (['col1', 'col2'], sklearn.preprocessing.StandardScaler()),
    (['col1', 'col2'], sklearn.preprocessing.OneHotEncoder()),
], df_out=True)
print(mapper.fit_transform(df).columns)

With sklearn 0.23 or 1.0, the output is:

Index(['col1_col2_0', 'col1_col2_1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

Note that StandardScaler cols get called {name}_{i} while OHE gets {name}_{estimator.get_feature_names()}.

Meanwhile sklearn 1.0+this PR outputs:

Index(['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0_0', 'col1_col2_x0_1',
       'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
       'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')

(but all these column names are not very helpful, as discussed in #174)

falcaopetri avatar Oct 19 '21 02:10 falcaopetri

The latest versions of scikit-learn (1.1+) have improved the coverage of Transformers that implement get_feature_names_out significantly (#21308 - Implement get_feature_names_out for all estimators). Is there any possibility of revisiting this issue? The current naming behaviour of DataFrameMapper is still not working correctly, as demonstrated by @falcaopetri above.

Having correct output names in a pipeline of sequential mappers is cruical and gets out of hand quickly when there are multiple columns in the dataset. The problem is exacerbated with Transformers that operate non-independently on multiple columns (such as PolynomialFeatures, which generates interaction features), since this prohibits the use of gen_features (which otherwise performs column naming better, i.e. without listing all columns for each feature), see example below:

from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import PolynomialFeatures  

df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
poly = PolynomialFeatures(degree=2, include_bias=False)
mapper = DataFrameMapper([
    (['col1', 'col2'], poly),
], df_out=True)

print(mapper.fit_transform(df).columns)
>>> FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)
>>> ['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0^2', 'col1_col2_x0 x1', 'col1_col2_x1^2']

print(poly.get_feature_names_out(['col1', 'col2']))
>>> ['col1' 'col2' 'col1^2' 'col1 col2' 'col2^2']

StochasticBoris avatar Aug 05 '22 07:08 StochasticBoris

Hi, Let me review the PR this week and merge it. I think it will be a major release as we will break some of the existing functionalities.

ragrawal avatar Aug 07 '22 06:08 ragrawal

Hi, everyone. Please let me know if there's anything I can do from my end.

falcaopetri avatar Aug 07 '22 14:08 falcaopetri

你好,已收到,谢谢。

hu-minghao avatar Aug 07 '22 14:08 hu-minghao

hi @falcaopetri -- thanks for your contribution. I made few changes to your PR. But I think we need to rethink about whole alias/prefix/suffix . If you are available, we can have a quick chat and discuss how to handle it.

Few updates: It seems there is difference in get_feature_names_out between 1.1.0 and 1.1.2 for sklearn.decomposition.PCA

ragrawal avatar Aug 08 '22 03:08 ragrawal

你好,已收到,谢谢。

hu-minghao avatar Oct 11 '22 07:10 hu-minghao

I share here a workaround that works for me:

def _fix_column_names(df: pd.DataFrame, mapper: DataFrameMapper) -> pd.DataFrame:
    for columns, transformer, kwargs in mapper.built_features:
        if (isinstance(transformer, OneHotEncoder)
                or (isinstance(transformer, Pipeline) and any(isinstance(t, OneHotEncoder) for t in transformer))):
            assert isinstance(columns, Iterable) and not isinstance(columns, str)

            new_names = transformer.get_feature_names_out(columns)

            old_name_prefix = kwargs.get("alias", "_".join(str(c) for c in columns))
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))
        elif isinstance(transformer, Pipeline) and isinstance(transformer[0], MultiLabelBinarizer):
            # The way sklearn-pandas infers the names is by iterating the transformers and getting the names and trying
            # to get the features names that are available from the last one that has them. Then, it checks if their
            # length matches the output number of features. However, if the binarizer is followed by feature selection,
            # this process fails as the previous condition is not met. So we handle it manually here.
            assert isinstance(columns, str)

            # `MultiLabelBinarizer` doesn't implement `get_feature_names_out`.
            new_names = [f"{columns}_{c}" for c in transformer[0].classes_]

            # We slice as an iterator and not by passing a slice to `__getitem__` because if the transformer is of type
            # `TransformerPipeline` then it fails.
            for t in itertools.islice(transformer, 1, None):
                new_names = t.get_feature_names_out(new_names)

            old_name_prefix = kwargs.get("alias", columns)
            old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]

            df = df.rename(columns=dict(zip(old_names, new_names)))

    return df

bryant1410 avatar Apr 26 '23 22:04 bryant1410

你好,已收到,谢谢。

hu-minghao avatar Apr 26 '23 22:04 hu-minghao