FEAT - Add an `Apply` transformer that automatically chooses the correct ApplyTo depending on the transformer
This is something that was discussed briefly IRL with @GaelVaroquaux, and it's going to be a feature that will not be implemented in the short term.
The rough idea (needs a lot of refinement) is to have an object that's akin to the tabular_pipeline, in the sense that it takes some kind of transformer (PCA, StringEncoder), then depending on the transformer it executes either ApplyToCols or ApplyToFrame on the columns that are specified by the user.
The idea is letting the user call "Apply" (or apply, as a function?) to their data, without having to figure out whether they should use ApplyToCols or ApplyToFrame.
I think this is the gist of it, please correct me if I'm wrong @GaelVaroquaux
it's a function you need, and it already exists (wrap_transformer) -- you just need to come up with a better name & expose it in the public API
TIL about wrap_transformer. Looking at the code, it uses ApplyToCols if the transformer is a single column transformer, and ApplyToFrame if it does not. I think this is already good enough for the use case we are considering, at least for a first version.
I'm not sure if. @GaelVaroquaux wants a more advanced check. I still think we should make wrap_transformer public to begin with, and possibly extend it later if needed.
If we go with making wrap_transformer public, we should address #1730 before. That way we can refer to the docs of the SingleColumnTransformer to explain what wrap_transformer is checking.
We just need to expose it as an object.
I just don't love the wording "Apply", which I don't find explicit.
How about "OnCols"?
I suggested "Apply" because we're already using "ApplyToCols" and "ApplyToFrame", but I agree it's too generic.
Maybe "ApplyTransformer"? Or "TransformCols"?
I don't think "OnCols" is much clearer in its intent
Maybe "ApplyTransformer"? Or "TransformCols"?
Too long, IMHO
I don't think "OnCols" is much clearer in its intent
I actually would really like to reuse the name "ApplyToCols" which I find quite self explanatory.
Could we / should we rename ApplyToCols to reuse this name? We could shuffle the names around as follows:
- ApplyToCols -> ApplySeparateCols
- ApplyToFrame -> ApplySubFrame
- New object = ApplyToCols
The reason that I'm bringing all this up is that I actually find myself using ApplyToCols a huge amount, and I really like the vibe that I get with it. I want to push it a lot. However, I find that I hit the difficulty of having to explain the nuance between ApplyToCols and ApplyToFrame, which muddies the message.
We just need to expose it as an object.
do you mean a transformer class, and if so why do you think it is needed rather than just a function? a class would add a layer of wrapping around the transformer that does nothing
rather than ApplySeparateCols I would suggest ApplyOnEachCol which is shorter and highlights the fact that the transformation is done independently on each column by using the singular and "each"
Normally I would argue against reusing the existing name ApplyToCols for the new object because the deprecation cycle would be very messy. however in this case it is possible that the new one would actually end up having the same behavior as the existing one in most places where it is already being used
Also if we do decide to actually have a new class rather than a function returning an instance of one of the existing classes, then we don't need to keep the old classes. We can just have ApplyToCols(columnwise=True) and ApplyToCols(columnwise=False) (and the default is auto). But having separate classes is much better because they do not have the same attributes: if column-by-column we need a dictionary of column name to transformer, whereas in the other case we have just 1 transformer
do you mean a transformer class, and if so why do you think it is needed rather than just a function?
Consistency: as a user, I need to remember that for this step, it's a function not an object. People struggle with inconsistency. Regularities help them
rather than ApplySeparateCols I would suggest ApplyOnEachCol which is shorter and highlights the fact that the transformation is done independently on each column by using the singular and "each"
Great! I buy that
Normally I would argue against reusing the existing name ApplyToCols for the new object because the deprecation cycle would be very messy. however in this case it is possible that the new one would actually end up having the same behavior as the existing one in most places where it is already being used
That's my hope too.
But having separate classes is much better because they do not have the same attributes: if column-by-column we need a dictionary of column name to transformer, whereas in the other case we have just 1 transformer
Indeed. But the users will have to find this out a some point or the other, even if a function does the choice for them, as the two cases are not interchangeable.
Summary from the skrub meeting.
- The current ApplyToCols and ApplyToFrame should be renamed, and should remain available to the user in the main skrub namespace.
- The new object should be called ApplyToCols (so there should be a deprecation cycle), and it should still be a transformer (I agree with @GaelVaroquaux in that I prefer a transformer rather than a function, even if the function could be simpler)
- Most of the remaining details depend on the implementation so will need to wait until we start working on the actual code