sklearn-pandas
sklearn-pandas copied to clipboard
Use new transformer.get_feature_names_out function
Transformer's get_output_names
is getting deprecated in favor of get_feature_names_out
. It will be removed by sklearn 1.2 (see sklearn v1.0 changelog, scikit-learn/scikit-learn#18444, and, for example, OneHotEncoder.get_feature_names).
This PR:
- Prefers
estimator.get_feature_names_out()
overestimator.get_features_names()
- Configure nox to run tests with both scikit-learn 0.23 and 1.0
The change currently breaks the README#Dynamic Columns example. This happens because there is no StandardScaler.get_features_names
in either sklearn 0.23 or 1.0:
- Current example's
transformed_names_
is['x_0', 'x_1', 'x_2', 'x_3', 'petal_0', 'petal_1']
. - In sklearn 1.0 though, there is a
StandardScaler.get_features_names_out
, which is used in this PR and therefore produces the output['x_x0', 'x_x1', 'x_x2', 'x_x3', 'petal_0', 'petal_1']
.
FYI: it seems sklearn >= 1.0 requires Python>=3.7.
One thing I noted is that current implementation is already a little bit inconsistent within sklearn 0.23:
import pandas as pd
import sklearn.preprocessing
from sklearn_pandas import DataFrameMapper
df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
mapper = DataFrameMapper([
(['col1', 'col2'], sklearn.preprocessing.StandardScaler()),
(['col1', 'col2'], sklearn.preprocessing.OneHotEncoder()),
], df_out=True)
print(mapper.fit_transform(df).columns)
With sklearn 0.23 or 1.0, the output is:
Index(['col1_col2_0', 'col1_col2_1', 'col1_col2_x0_0', 'col1_col2_x0_1',
'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')
Note that StandardScaler
cols get called {name}_{i}
while OHE
gets {name}_{estimator.get_feature_names()}
.
Meanwhile sklearn 1.0+this PR outputs:
Index(['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0_0', 'col1_col2_x0_1',
'col1_col2_x0_2', 'col1_col2_x0_3', 'col1_col2_x1_0', 'col1_col2_x1_1',
'col1_col2_x1_2', 'col1_col2_x1_3'], dtype='object')
(but all these column names are not very helpful, as discussed in #174)
The latest versions of scikit-learn (1.1+) have improved the coverage of Transformers that implement get_feature_names_out
significantly (#21308 - Implement get_feature_names_out for all estimators). Is there any possibility of revisiting this issue? The current naming behaviour of DataFrameMapper
is still not working correctly, as demonstrated by @falcaopetri above.
Having correct output names in a pipeline of sequential mappers is cruical and gets out of hand quickly when there are multiple columns in the dataset. The problem is exacerbated with Transformers
that operate non-independently on multiple columns (such as PolynomialFeatures
, which generates interaction features), since this prohibits the use of gen_features
(which otherwise performs column naming better, i.e. without listing all columns for each feature), see example below:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import PolynomialFeatures
df = pd.DataFrame({'col1': [0, 0, 1, 1, 2, 3, 0], 'col2': [0, 0, 1, 1, 2, 3, 0]})
poly = PolynomialFeatures(degree=2, include_bias=False)
mapper = DataFrameMapper([
(['col1', 'col2'], poly),
], df_out=True)
print(mapper.fit_transform(df).columns)
>>> FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
warnings.warn(msg, category=FutureWarning)
>>> ['col1_col2_x0', 'col1_col2_x1', 'col1_col2_x0^2', 'col1_col2_x0 x1', 'col1_col2_x1^2']
print(poly.get_feature_names_out(['col1', 'col2']))
>>> ['col1' 'col2' 'col1^2' 'col1 col2' 'col2^2']
Hi, Let me review the PR this week and merge it. I think it will be a major release as we will break some of the existing functionalities.
Hi, everyone. Please let me know if there's anything I can do from my end.
你好,已收到,谢谢。
hi @falcaopetri -- thanks for your contribution. I made few changes to your PR. But I think we need to rethink about whole alias/prefix/suffix . If you are available, we can have a quick chat and discuss how to handle it.
Few updates: It seems there is difference in get_feature_names_out between 1.1.0 and 1.1.2 for sklearn.decomposition.PCA
你好,已收到,谢谢。
I share here a workaround that works for me:
def _fix_column_names(df: pd.DataFrame, mapper: DataFrameMapper) -> pd.DataFrame:
for columns, transformer, kwargs in mapper.built_features:
if (isinstance(transformer, OneHotEncoder)
or (isinstance(transformer, Pipeline) and any(isinstance(t, OneHotEncoder) for t in transformer))):
assert isinstance(columns, Iterable) and not isinstance(columns, str)
new_names = transformer.get_feature_names_out(columns)
old_name_prefix = kwargs.get("alias", "_".join(str(c) for c in columns))
old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]
df = df.rename(columns=dict(zip(old_names, new_names)))
elif isinstance(transformer, Pipeline) and isinstance(transformer[0], MultiLabelBinarizer):
# The way sklearn-pandas infers the names is by iterating the transformers and getting the names and trying
# to get the features names that are available from the last one that has them. Then, it checks if their
# length matches the output number of features. However, if the binarizer is followed by feature selection,
# this process fails as the previous condition is not met. So we handle it manually here.
assert isinstance(columns, str)
# `MultiLabelBinarizer` doesn't implement `get_feature_names_out`.
new_names = [f"{columns}_{c}" for c in transformer[0].classes_]
# We slice as an iterator and not by passing a slice to `__getitem__` because if the transformer is of type
# `TransformerPipeline` then it fails.
for t in itertools.islice(transformer, 1, None):
new_names = t.get_feature_names_out(new_names)
old_name_prefix = kwargs.get("alias", columns)
old_names = [f"{old_name_prefix}_{i}" for i in range(len(new_names))]
df = df.rename(columns=dict(zip(old_names, new_names)))
return df
你好,已收到,谢谢。