skrub Shorthand for getting only the preprocessing part of the TableVectorizer

Problem Description

Sometimes we may want to apply the preprocessing/cleaning steps of the TableVectorizer (parsing datetimes, handling pandas extension dtypes, etc.), while handling the actual encoding in separate pipeline steps. This will probably become more relevant when the Recipe (or whatever its name will be) is introduced: we can use it to build exactly the pipeline we want, but we would still like to apply the default cleaning done by the TableVectorizer

If this sounds like a plausible use-case maybe we could have a shorthand for

TableVectorizer(
    high_cardinality_transformer="passthrough",
    low_cardinality_transformer="passthrough",
    datetime_transformer="passthrough",
    numeric_transformer="passthrough",
    specific_transformers=(),
)

maybe

TableSkrubber()

Feature Description

...

Alternative Solutions

No response

Additional Context

No response

Jun 03 '24 15:06 jeromedockes

some examples of the kind of cleaning the tablevectorizer does:


>>> import pandas as pd
>>> from skrub import TableVectorizer


>>> skrubber = TableVectorizer(
...     high_cardinality_transformer="passthrough",
...     low_cardinality_transformer="passthrough",
...     datetime_transformer="passthrough",
...     numeric_transformer="passthrough",
...     specific_transformers=(),
... )

>>> df = pd.DataFrame({
...     'a': ['2020-01-02', '2020-01-03'],
...     'b': ['2.2', 'nan'],
...     'c': [1.5, pd.NA],
...     'd': [True, False],
...     'e': pd.Series([4.5, 'a'], dtype='category'),
... })
>>> df
            a    b     c      d    e
0  2020-01-02  2.2   1.5   True  4.5
1  2020-01-03  nan  <NA>  False    a
>>> df.dtypes
a      object
b      object
c      object
d        bool
e    category
dtype: object
>>> df['e'].cat.categories
Index([4.5, 'a'], dtype='object')

>>> skrubbed = skrubber.fit_transform(df)
>>> skrubbed
           a    b    c    d    e
0 2020-01-02  2.2  1.5  1.0  4.5
1 2020-01-03  NaN  NaN  0.0    a
>>> skrubbed.dtypes
a    datetime64[ns]
b           float32
c           float32
d           float32
e          category
dtype: object
>>> skrubbed['e'].cat.categories
Index(['4.5', 'a'], dtype='object')

Jun 03 '24 15:06 jeromedockes

I like the name "Skrubber"

Jun 12 '24 10:06 GaelVaroquaux

An object like this would be very useful to trim tables before running the TableReport on them #1257

Mar 19 '25 10:03 rcap107

closed by #1266

Apr 01 '25 14:04 Vincent-Maladiere