skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Remove duplicated logic in TableVectorizer

Open rcap107 opened this issue 7 months ago • 2 comments

Right the Cleaner is using the function _get_preprocessors to get the list of preprocessors that are used to clean the data. The TableVectorizer should do the same.

The logic is already there, it's just a matter of replacing some code and checking tests.

rcap107 avatar May 22 '25 15:05 rcap107

@rcap107 Hi

I had gone through the _table_vectorizer,py and tried enhancing the logic and code please let me know if I can contribute.

Creates a new _make_preprocessing_steps() helper Uses this helper in both Cleaner and TableVectorizer Eliminates duplicated preprocessing logic Prepares your codebase for easier future maintenance

Example :

def _make_preprocessing_steps( *, cols, drop_null_fraction, drop_if_unique, drop_if_constant, datetime_format=None, n_jobs=1, add_tofloat32=True, ): steps = [CheckInputDataFrame()] transformer_list = [ CleanNullStrings(), DropUninformative( drop_null_fraction=drop_null_fraction, drop_if_constant=drop_if_constant, drop_if_unique=drop_if_unique, ), ToDatetime(format=datetime_format), ] if add_tofloat32: transformer_list.append(ToFloat32()) transformer_list.extend([CleanCategories(), ToStr()])

for transformer in transformer_list:
    steps.append(
        wrap_transformer(
            transformer,
            cols,
            allow_reject=True,
            n_jobs=n_jobs,
            columnwise=True,
        )
    )
return steps

Updated Cleaner.fit_transform

class Cleaner(TransformerMixin, BaseEstimator): ... def fit_transform(self, X, y=None): all_steps = make_preprocessing_steps( cols=s.all(), drop_null_fraction=self.drop_null_fraction, drop_if_constant=self.drop_if_constant, drop_if_unique=self.drop_if_unique, datetime_format=self.datetime_format, n_jobs=self.n_jobs, add_tofloat32=False, ) self.pipeline = make_pipeline(*all_steps) result = self.pipeline.fit_transform(X) input_names = all_steps[0].feature_names_out self.all_processing_steps = {col: [] for col in input_names} for step in all_steps[1:]: for col, transformer in step.transformers.items(): self.all_processing_steps_[col].append(transformer) return result

Updated TableVectorizer._make_pipeline

class TableVectorizer(TransformerMixin, BaseEstimator): ... def _make_pipeline(self): def add_step(steps, transformer, cols, allow_reject=False): steps.append( wrap_transformer( _check_transformer(transformer), cols, allow_reject=allow_reject, n_jobs=self.n_jobs, columnwise=True, ) ) return steps[-1]

    cols = s.all() - self._specific_columns

    self._preprocessors = _make_preprocessing_steps(
        cols=cols,
        drop_null_fraction=self.drop_null_fraction,
        drop_if_constant=self.drop_if_constant,
        drop_if_unique=self.drop_if_unique,
        datetime_format=self.datetime_format,
        n_jobs=self.n_jobs,
        add_tofloat32=True,
    )

    self._encoders = []
    self._named_encoders = {}
    for name, selector in [
        ("numeric", s.numeric()),
        ("datetime", s.any_date()),
        ("low_cardinality", s.cardinality_below(self.cardinality_threshold)),
        ("high_cardinality", s.all()),
    ]:
        self._named_encoders[name] = add_step(
            self._encoders,
            getattr(self, name),
            cols & selector - _created_by(*self._encoders),
        )

    self._specific_transformers = []
    for specific_transformer, specific_cols in self.specific_transformers:
        add_step(self._specific_transformers, specific_transformer, specific_cols)

    self._postprocessors = []
    add_step(
        self._postprocessors,
        ToFloat32(),
        s.all() - _created_by(*self._specific_transformers) - s.categorical(),
        allow_reject=True,
    )
    self._pipeline = make_pipeline(
        *self._preprocessors,
        *self._encoders,
        *self._specific_transformers,
        *self._postprocessors,
    )

PB-Mehta avatar May 22 '25 18:05 PB-Mehta

Hi @PB-Mehta, thanks for working on the issue. You can definitely take this up, but the proper way to contribute is by opening a PR with the changes. With PRs we can check the diffs and run all the necessary testing to make sure everything works correctly after we merge.

You can find the guide on how to do it here: https://skrub-data.org/stable/CONTRIBUTING.html#writing-your-first-pull-request

I'll be happy to review the PR when it's ready 👍

rcap107 avatar May 23 '25 09:05 rcap107

I am working on this during the skrub sprint at Probabl's 09-07-2025

gabrielapgomezji avatar Jul 09 '25 08:07 gabrielapgomezji

Closed by #1497

Vincent-Maladiere avatar Jul 15 '25 14:07 Vincent-Maladiere