skrub
skrub copied to clipboard
TableVectoriser's "numerical_transformer" does not accept Pipelines
Describe the bug
As per the Documentation of TableVectoriser here:
Transformer used on numerical features. Can either be a transformer object instance (e.g. StandardScaler), a Pipeline containing the preprocessing steps, ‘drop’ for dropping the columns, ‘remainder’ for applying remainder, or ‘passthrough’ to return the unencoded columns (default).
So i would assume that i can pass a pipeline.
Steps/Code to Reproduce
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from skrub import TableVectorizer
# get data
cancer = load_breast_cancer(return_X_y = True, as_frame = True)
X = cancer[0]
y = cancer[1]
# Numerical transformer. No NAN in the data but it could be any pipeline
num_prep = make_pipeline(SimpleImputer(add_indicator = True),
StandardScaler())
#TableVectoriser
encoder = TableVectorizer(numerical_transformer = num_prep)
# Model
clf = make_pipeline(encoder, LogisticRegression())
clf.fit(X, y)```
### Expected Results
Should fit the data
### Actual Results
ValueError: 'transformer' must be an instance of sklearn.base.TransformerMixin, 'remainder' or 'passthrough'. Got transformer=Pipeline(steps=[('simpleimputer', SimpleImputer(add_indicator=True)),
('standardscaler', StandardScaler())]).
### Versions
```shell
System:
python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ]
executable: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/bin/python
machine: macOS-14.3-arm64-arm-64bit
Python dependencies:
sklearn: 1.4.0
pip: 23.3.2
setuptools: 69.0.3
numpy: 1.26.3
scipy: 1.12.0
Cython: None
pandas: 2.2.0
matplotlib: None
joblib: 1.3.2
threadpoolctl: 3.2.0
Built with OpenMP: True
threadpoolctl info:
user_api: blas
internal_api: openblas
num_threads: 8
prefix: libopenblas
filepath: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/lib/libopenblas.0.dylib
version: 0.3.26
threading_layer: openmp
architecture: VORTEX
user_api: openmp
internal_api: openmp
num_threads: 8
prefix: libomp
filepath: /opt/homebrew/Caskroom/miniforge/base/envs/test_skrub/lib/libomp.dylib
version: None
0.1.0
thanks a lot for reporting this! We'll make sure to address it in #877
here is a reproducer, to be added to our test suite:
import pandas as pd
from skrub import TableVectorizer
from sklearn.pipeline import make_pipeline
df = pd.DataFrame(dict(a=[1.1, 2.2]))
tv = TableVectorizer(numerical_transformer=make_pipeline('passthrough'))
tv.fit(df)
fixed by #902