Remove duplicated logic in TableVectorizer
Right the Cleaner is using the function _get_preprocessors to get the list of preprocessors that are used to clean the data. The TableVectorizer should do the same.
The logic is already there, it's just a matter of replacing some code and checking tests.
@rcap107 Hi
I had gone through the _table_vectorizer,py and tried enhancing the logic and code please let me know if I can contribute.
Creates a new _make_preprocessing_steps() helper Uses this helper in both Cleaner and TableVectorizer Eliminates duplicated preprocessing logic Prepares your codebase for easier future maintenance
Example :
def _make_preprocessing_steps( *, cols, drop_null_fraction, drop_if_unique, drop_if_constant, datetime_format=None, n_jobs=1, add_tofloat32=True, ): steps = [CheckInputDataFrame()] transformer_list = [ CleanNullStrings(), DropUninformative( drop_null_fraction=drop_null_fraction, drop_if_constant=drop_if_constant, drop_if_unique=drop_if_unique, ), ToDatetime(format=datetime_format), ] if add_tofloat32: transformer_list.append(ToFloat32()) transformer_list.extend([CleanCategories(), ToStr()])
for transformer in transformer_list:
steps.append(
wrap_transformer(
transformer,
cols,
allow_reject=True,
n_jobs=n_jobs,
columnwise=True,
)
)
return steps
Updated Cleaner.fit_transform
class Cleaner(TransformerMixin, BaseEstimator): ... def fit_transform(self, X, y=None): all_steps = make_preprocessing_steps( cols=s.all(), drop_null_fraction=self.drop_null_fraction, drop_if_constant=self.drop_if_constant, drop_if_unique=self.drop_if_unique, datetime_format=self.datetime_format, n_jobs=self.n_jobs, add_tofloat32=False, ) self.pipeline = make_pipeline(*all_steps) result = self.pipeline.fit_transform(X) input_names = all_steps[0].feature_names_out self.all_processing_steps = {col: [] for col in input_names} for step in all_steps[1:]: for col, transformer in step.transformers.items(): self.all_processing_steps_[col].append(transformer) return result
Updated TableVectorizer._make_pipeline
class TableVectorizer(TransformerMixin, BaseEstimator): ... def _make_pipeline(self): def add_step(steps, transformer, cols, allow_reject=False): steps.append( wrap_transformer( _check_transformer(transformer), cols, allow_reject=allow_reject, n_jobs=self.n_jobs, columnwise=True, ) ) return steps[-1]
cols = s.all() - self._specific_columns
self._preprocessors = _make_preprocessing_steps(
cols=cols,
drop_null_fraction=self.drop_null_fraction,
drop_if_constant=self.drop_if_constant,
drop_if_unique=self.drop_if_unique,
datetime_format=self.datetime_format,
n_jobs=self.n_jobs,
add_tofloat32=True,
)
self._encoders = []
self._named_encoders = {}
for name, selector in [
("numeric", s.numeric()),
("datetime", s.any_date()),
("low_cardinality", s.cardinality_below(self.cardinality_threshold)),
("high_cardinality", s.all()),
]:
self._named_encoders[name] = add_step(
self._encoders,
getattr(self, name),
cols & selector - _created_by(*self._encoders),
)
self._specific_transformers = []
for specific_transformer, specific_cols in self.specific_transformers:
add_step(self._specific_transformers, specific_transformer, specific_cols)
self._postprocessors = []
add_step(
self._postprocessors,
ToFloat32(),
s.all() - _created_by(*self._specific_transformers) - s.categorical(),
allow_reject=True,
)
self._pipeline = make_pipeline(
*self._preprocessors,
*self._encoders,
*self._specific_transformers,
*self._postprocessors,
)
Hi @PB-Mehta, thanks for working on the issue. You can definitely take this up, but the proper way to contribute is by opening a PR with the changes. With PRs we can check the diffs and run all the necessary testing to make sure everything works correctly after we merge.
You can find the guide on how to do it here: https://skrub-data.org/stable/CONTRIBUTING.html#writing-your-first-pull-request
I'll be happy to review the PR when it's ready 👍
I am working on this during the skrub sprint at Probabl's 09-07-2025
Closed by #1497