MLJ.jl icon indicating copy to clipboard operation
MLJ.jl copied to clipboard

Add fit_transform methods?

Open CameronBieganek opened this issue 5 years ago • 4 comments

Sometimes it's handy to have a combined fit_transform method. For example, right now I'm doing my one-hot encoding once at the beginning of my analysis (not in a pipeline). This is safe to do, since it's a static transformer that doesn't learn from the data. So for this use case, it would be cool if I could do something like this:

Xencoded = fit_transform(OneHotEncoder(), X)

Or perhaps it would make sense to just overload transform?

Xencoded = transform(OneHotEncoder(), X)

CameronBieganek avatar May 22 '20 16:05 CameronBieganek

Thanks for that. Yes, sk-learn has this, and so you are presumably not the first to wonder about this.

Users can sometimes get confused when there are lot of methods and many ways to do the same thing. In balancing convenience with simplicity, I feel some reluctance to do this. However, there is also a performance issue, which you did not raise, which is that it is sometimes cheaper to fit and transform in one go, so we may want to allow model-specific implementations of such methods as well.

I wonder what others think of the suggestion?

ablaom avatar May 26 '20 00:05 ablaom

I kind of like the approach of overloading transform for this. (If that's possible given the current method table.) I think it has a couple of advantages:

  • It doesn't introduce a new function to the API.
  • For some transformers the split between fit! and transform seems artificial. E.g., what does it mean to fit a OneHotEncoder? It seems like OneHotEncoder is really just a transformer with no fitting required.

CameronBieganek avatar May 26 '20 19:05 CameronBieganek

If we are going to do this, then I agree this might be the best option.

It seem the method table would allow this.

Still, like to ponder this a bit more before committing.

ablaom avatar Jun 02 '20 01:06 ablaom

I do not see the motivation for removing the fit with Onehotencoder.

As @CameronBieganek mentions sklearn has the option to fit_transform but it also has the option to first fit and then transform.

For some transformers the split between fit! and transform seems artificial. 
E.g., what does it mean to fit a OneHotEncoder?
 It seems like OneHotEncoder is really just a transformer with no fitting required.

I would argue otherwise. When you are doing Cross-Validation fitting a preprocess can be very handy. For example, one of feature values of a categorical variable might rarely appear. It might happen that the folds in which OneHotEncoder gets fitted do not have a particular feature value, making the model to simply ignore the creation of a new variable that might do more harm than good.

I do not have any counter argument for having a fit_transform method but I would vote against removing the capability to fit prepossesses that adapts to the input data when it is fitted.

davidbp avatar Dec 21 '20 16:12 davidbp