skrub icon indicating copy to clipboard operation
skrub copied to clipboard

Discussion - Refactoring the TableVectorizer so that it can take a `Cleaner` as transformer

Open rcap107 opened this issue 2 weeks ago • 3 comments

At the moment, the TableVectorizer and the Cleaner share a lot of the same parameters, which are forwarded to some of the underlying transformers, like drop_null_fraction and datetime_format. Then, the same function _get_preprocessors is called in to get the transformers that are then applied to the input data.

What I find problematic is that the behavior of the Cleaner and that of the TableVectorizer is diverging in various places:

  • The TableVectorizer converts series that contain complex objects like structs or lists to string, while the Cleaner does not do that by default anymore #1789
  • The TableVectorizer converts all numerical features to float32, while the Cleaner keeps the original numerical dtype
  • Depending on #1680, we might want to add a parameter to the Cleaner to avoid parsing numbers with string dtype, while the TableVectorizer would just transform them by default

My idea is to refactor the code of the TableVectorizer so that, along with the feature encoders (high_cardinality, numerical etc), it can also take a Cleaner, so something like

from skrub import Cleaner, TableVectorizer

cleaner = Cleaner(cast_to_str=False, drop_if_constant=True)

tv = TableVectorizer(cleaner=cleaner)

This would remove all the "cleaner-specific" parameters (like drop_null_fraction) from the TableVectorizer, which would instead create a default Cleaner that replicates the behavior with the current TableVectorizer defaults. Then, if the user needs to change specific parameters of the Cleaner, they can create a new instance with the parameters they need, and pass it to the TableVectorizer.

Some other points:

  • If we add more parameters for the cleaner, they won't have to be added to the TableVectorizer (which is what happened with the addition of DropUninformative), simplifying the signature and the docstring.
  • A downside is that this may make the behavior less intuitive. However, we are already telling the users that they need to create their own encoders in order to customize the parameters of the encoders used by the TableVectorizer, so I don't think that adding the Cleaner as encoder would not be much different from the current procedure
  • This refactoring would delimit the scope of the two transformers more clearly, where the Cleaner is only used for sanitizing the data, and the TableVectorizer is dealing only with encoding features. At the moment, the TableVectorizer can do a bit of both, but it does not have all the functions of the Cleaner.

All that said, I don't think this is a high priority issue, rather that it may be something worth considering if we keep on adding features to the Cleaner.

rcap107 avatar Dec 12 '25 12:12 rcap107

From a user standpoint, this separation would make usage more complex.

Our story with TableVectorizer is that it is a one-stop object to add support for complex dataframes on scikit-learn compatible estimators. It is a simple story that can easily be understood, and it is crucial for user experience and adoption. We should stick to this story and resulting functionality

GaelVaroquaux avatar Dec 12 '25 13:12 GaelVaroquaux

From a user standpoint, this separation would make usage more complex.

Our story with TableVectorizer is that it is a one-stop object to add support for complex dataframes on scikit-learn compatible estimators. It is a simple story that can easily be understood, and it is crucial for user experience and adoption. We should stick to this story and resulting functionality

Maybe I shouldn't have mentioned the scope of the two objects.

I am not suggesting that the TableVectorizer should not do the cleaning anymore. My suggestion would not change the user-facing behavior at all, as long as the user only needs the default table vectorizer.

The change would instead have an effect if the user needs to modify the cleaning parameters (which shouldn't happen often for the TableVectorizer).

In that scenario, they'd have to create a new Cleaner object with new parameters, rather than passing those parameters directly to the TableVectorizer. This is consistent with what we are asking in order to modify the other encoders, so I don't think it makes it more complicated than it already is for users that need that degree of flexibility.

Instead of

string = StringEncoder(n_components=10)
TableVectorizer(
	drop_null_fraction=0.3, 
	drop_if_constant=True, 
	datetime_format="%d %m %Y", 
	high_cardinality=string
)

it would become

cleaner = Cleaner(
	drop_null_fraction=0.3, 
	drop_if_constant=True, 
	datetime_format="%d %m %Y"
)
string = StringEncoder(n_components=10)

TableVectorizer(cleaner=cleaner, high_cardinality=string)

We don't have something like this:

TableVectorizer(n_components=10)

but we do have

TableVectorizer(drop_null_fraction=0.3)

which I don't find much better

rcap107 avatar Dec 12 '25 13:12 rcap107

From a user perspective, composition is a real challenge. It requires more abstract thinking, better understanding of object-oriented programming, and to write the code around the corresponding indirections.

Indeed, specifying the n_components in StringEncoder is already quite nasty and it is a major step in complexity to set it.

Hyper-parameter selection on these parameters is even more a problem, but in that sense the DataOps really help... than again, they require more abstraction. But it means that, as often (always?) the bottleneck to what we do is the amount of abstraction and expertise it requires from the user. And this makes me argue for avoiding composition in how we present things to the user

GaelVaroquaux avatar Dec 12 '25 14:12 GaelVaroquaux

And this makes me argue for avoiding composition in how we present things to the user

I feel like "the horse is out of the barn" for that part, given the other transformers. But I get the point about composition being complex.

In general, this issue should be handled with care and we need to have a clearer idea of what we want both the Cleaner and the TableVectorizer to do, and we need to keep an eye on backwards compatibility for the TV.

We also have a bunch of open issues/PRs that would change the behavior of either object so before embarking in a big refactoring we need to deal with those.

rcap107 avatar Dec 15 '25 08:12 rcap107