skrub icon indicating copy to clipboard operation
skrub copied to clipboard

FEAT - Add an `Apply` transformer that automatically chooses the correct ApplyTo depending on the transformer

Open rcap107 opened this issue 1 month ago • 10 comments

This is something that was discussed briefly IRL with @GaelVaroquaux, and it's going to be a feature that will not be implemented in the short term.

The rough idea (needs a lot of refinement) is to have an object that's akin to the tabular_pipeline, in the sense that it takes some kind of transformer (PCA, StringEncoder), then depending on the transformer it executes either ApplyToCols or ApplyToFrame on the columns that are specified by the user.

The idea is letting the user call "Apply" (or apply, as a function?) to their data, without having to figure out whether they should use ApplyToCols or ApplyToFrame.

I think this is the gist of it, please correct me if I'm wrong @GaelVaroquaux

rcap107 avatar Nov 14 '25 09:11 rcap107

it's a function you need, and it already exists (wrap_transformer) -- you just need to come up with a better name & expose it in the public API

jeromedockes avatar Nov 14 '25 09:11 jeromedockes

TIL about wrap_transformer. Looking at the code, it uses ApplyToCols if the transformer is a single column transformer, and ApplyToFrame if it does not. I think this is already good enough for the use case we are considering, at least for a first version.

I'm not sure if. @GaelVaroquaux wants a more advanced check. I still think we should make wrap_transformer public to begin with, and possibly extend it later if needed.

If we go with making wrap_transformer public, we should address #1730 before. That way we can refer to the docs of the SingleColumnTransformer to explain what wrap_transformer is checking.

rcap107 avatar Nov 14 '25 09:11 rcap107

We just need to expose it as an object.

I just don't love the wording "Apply", which I don't find explicit.

How about "OnCols"?

GaelVaroquaux avatar Nov 14 '25 12:11 GaelVaroquaux

I suggested "Apply" because we're already using "ApplyToCols" and "ApplyToFrame", but I agree it's too generic.

Maybe "ApplyTransformer"? Or "TransformCols"?

I don't think "OnCols" is much clearer in its intent

rcap107 avatar Nov 14 '25 16:11 rcap107

Maybe "ApplyTransformer"? Or "TransformCols"?

Too long, IMHO

I don't think "OnCols" is much clearer in its intent

I actually would really like to reuse the name "ApplyToCols" which I find quite self explanatory.

Could we / should we rename ApplyToCols to reuse this name? We could shuffle the names around as follows:

  • ApplyToCols -> ApplySeparateCols
  • ApplyToFrame -> ApplySubFrame
  • New object = ApplyToCols

The reason that I'm bringing all this up is that I actually find myself using ApplyToCols a huge amount, and I really like the vibe that I get with it. I want to push it a lot. However, I find that I hit the difficulty of having to explain the nuance between ApplyToCols and ApplyToFrame, which muddies the message.

GaelVaroquaux avatar Nov 14 '25 17:11 GaelVaroquaux

We just need to expose it as an object.

do you mean a transformer class, and if so why do you think it is needed rather than just a function? a class would add a layer of wrapping around the transformer that does nothing

jeromedockes avatar Nov 16 '25 07:11 jeromedockes

rather than ApplySeparateCols I would suggest ApplyOnEachCol which is shorter and highlights the fact that the transformation is done independently on each column by using the singular and "each"

Normally I would argue against reusing the existing name ApplyToCols for the new object because the deprecation cycle would be very messy. however in this case it is possible that the new one would actually end up having the same behavior as the existing one in most places where it is already being used

Also if we do decide to actually have a new class rather than a function returning an instance of one of the existing classes, then we don't need to keep the old classes. We can just have ApplyToCols(columnwise=True) and ApplyToCols(columnwise=False) (and the default is auto). But having separate classes is much better because they do not have the same attributes: if column-by-column we need a dictionary of column name to transformer, whereas in the other case we have just 1 transformer

jeromedockes avatar Nov 16 '25 08:11 jeromedockes

do you mean a transformer class, and if so why do you think it is needed rather than just a function?

Consistency: as a user, I need to remember that for this step, it's a function not an object. People struggle with inconsistency. Regularities help them

GaelVaroquaux avatar Nov 16 '25 21:11 GaelVaroquaux

rather than ApplySeparateCols I would suggest ApplyOnEachCol which is shorter and highlights the fact that the transformation is done independently on each column by using the singular and "each"

Great! I buy that

Normally I would argue against reusing the existing name ApplyToCols for the new object because the deprecation cycle would be very messy. however in this case it is possible that the new one would actually end up having the same behavior as the existing one in most places where it is already being used

That's my hope too.

But having separate classes is much better because they do not have the same attributes: if column-by-column we need a dictionary of column name to transformer, whereas in the other case we have just 1 transformer

Indeed. But the users will have to find this out a some point or the other, even if a function does the choice for them, as the two cases are not interchangeable.

GaelVaroquaux avatar Nov 16 '25 21:11 GaelVaroquaux

Summary from the skrub meeting.

  • The current ApplyToCols and ApplyToFrame should be renamed, and should remain available to the user in the main skrub namespace.
  • The new object should be called ApplyToCols (so there should be a deprecation cycle), and it should still be a transformer (I agree with @GaelVaroquaux in that I prefer a transformer rather than a function, even if the function could be simpler)
  • Most of the remaining details depend on the implementation so will need to wait until we start working on the actual code

rcap107 avatar Nov 17 '25 10:11 rcap107