skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Shorthand for getting only the preprocessing part of the TableVectorizer

Open jeromedockes opened this issue 1 year ago • 2 comments

Problem Description

Sometimes we may want to apply the preprocessing/cleaning steps of the TableVectorizer (parsing datetimes, handling pandas extension dtypes, etc.), while handling the actual encoding in separate pipeline steps. This will probably become more relevant when the Recipe (or whatever its name will be) is introduced: we can use it to build exactly the pipeline we want, but we would still like to apply the default cleaning done by the TableVectorizer

If this sounds like a plausible use-case maybe we could have a shorthand for

TableVectorizer(
    high_cardinality_transformer="passthrough",
    low_cardinality_transformer="passthrough",
    datetime_transformer="passthrough",
    numeric_transformer="passthrough",
    specific_transformers=(),
)

maybe

TableSkrubber()

Feature Description

...

Alternative Solutions

No response

Additional Context

No response

jeromedockes avatar Jun 03 '24 15:06 jeromedockes

some examples of the kind of cleaning the tablevectorizer does:


>>> import pandas as pd
>>> from skrub import TableVectorizer


>>> skrubber = TableVectorizer(
...     high_cardinality_transformer="passthrough",
...     low_cardinality_transformer="passthrough",
...     datetime_transformer="passthrough",
...     numeric_transformer="passthrough",
...     specific_transformers=(),
... )

>>> df = pd.DataFrame({
...     'a': ['2020-01-02', '2020-01-03'],
...     'b': ['2.2', 'nan'],
...     'c': [1.5, pd.NA],
...     'd': [True, False],
...     'e': pd.Series([4.5, 'a'], dtype='category'),
... })
>>> df
            a    b     c      d    e
0  2020-01-02  2.2   1.5   True  4.5
1  2020-01-03  nan  <NA>  False    a
>>> df.dtypes
a      object
b      object
c      object
d        bool
e    category
dtype: object
>>> df['e'].cat.categories
Index([4.5, 'a'], dtype='object')

>>> skrubbed = skrubber.fit_transform(df)
>>> skrubbed
           a    b    c    d    e
0 2020-01-02  2.2  1.5  1.0  4.5
1 2020-01-03  NaN  NaN  0.0    a
>>> skrubbed.dtypes
a    datetime64[ns]
b           float32
c           float32
d           float32
e          category
dtype: object
>>> skrubbed['e'].cat.categories
Index(['4.5', 'a'], dtype='object')

jeromedockes avatar Jun 03 '24 15:06 jeromedockes

I like the name "Skrubber"

GaelVaroquaux avatar Jun 12 '24 10:06 GaelVaroquaux

An object like this would be very useful to trim tables before running the TableReport on them #1257

rcap107 avatar Mar 19 '25 10:03 rcap107

closed by #1266

Vincent-Maladiere avatar Apr 01 '25 14:04 Vincent-Maladiere