MLJ.jl
MLJ.jl copied to clipboard
Add fit_transform methods?
Sometimes it's handy to have a combined fit_transform method. For example, right now I'm doing my one-hot encoding once at the beginning of my analysis (not in a pipeline). This is safe to do, since it's a static transformer that doesn't learn from the data. So for this use case, it would be cool if I could do something like this:
Xencoded = fit_transform(OneHotEncoder(), X)
Or perhaps it would make sense to just overload transform?
Xencoded = transform(OneHotEncoder(), X)
Thanks for that. Yes, sk-learn has this, and so you are presumably not the first to wonder about this.
Users can sometimes get confused when there are lot of methods and many ways to do the same thing. In balancing convenience with simplicity, I feel some reluctance to do this. However, there is also a performance issue, which you did not raise, which is that it is sometimes cheaper to fit and transform in one go, so we may want to allow model-specific implementations of such methods as well.
I wonder what others think of the suggestion?
I kind of like the approach of overloading transform for this. (If that's possible given the current method table.) I think it has a couple of advantages:
- It doesn't introduce a new function to the API.
- For some transformers the split between
fit!andtransformseems artificial. E.g., what does it mean to fit aOneHotEncoder? It seems likeOneHotEncoderis really just a transformer with no fitting required.
If we are going to do this, then I agree this might be the best option.
It seem the method table would allow this.
Still, like to ponder this a bit more before committing.
I do not see the motivation for removing the fit with Onehotencoder.
As @CameronBieganek mentions sklearn has the option to fit_transform but it also has the option to first fit and then transform.
For some transformers the split between fit! and transform seems artificial.
E.g., what does it mean to fit a OneHotEncoder?
It seems like OneHotEncoder is really just a transformer with no fitting required.
I would argue otherwise. When you are doing Cross-Validation fitting a preprocess can be very handy. For example, one of feature values of a categorical variable might rarely appear. It might happen that the folds in which OneHotEncoder gets fitted do not have a particular feature value, making the model to simply ignore the creation of a new variable that might do more harm than good.
I do not have any counter argument for having a fit_transform method but I would vote against removing the capability to fit prepossesses that adapts to the input data when it is fitted.