mljar-supervised icon indicating copy to clipboard operation
mljar-supervised copied to clipboard

Make AutoML scikit-learn compatible

Open diogosilva30 opened this issue 3 years ago • 18 comments

I think it would be interesting to add a feature to export the best model with a scikit-learn wrapper. This would allow integrating the best AutoML model into a scikit-learn workflow. I think most of the models that AutoML uses are already from scikit-learn, and those who aren't do provide scikit-learn wrappers, so I think it would be easy to implement. Is there anything that makes this feature 'impossible' to implement?

diogosilva30 avatar Aug 26 '20 03:08 diogosilva30

Hey @spamz23! You are right. It is almost implemented in AutoML (which is a wrapper for all algorithms). Maybe there is predict_proba missing in AutoML, but it is because the predict method is returning probabilities and labels. This can be fixed (easily). Would you like to add it (predict_proba in AutoML)? I will help you.

Could you give a minimal code example how would you like to use AutoML in scikit-learn workflow?

pplonski avatar Aug 26 '20 13:08 pplonski

In order to provide skicit-learn compatibility, AutoML is missing:

  1. Function predict_proba()
  2. Function score() to calculate accuracy
  3. Function get_params() and set_params()

Suggestion:

  • fit() function should accept list or numpy.ndarray or pandas.DataFrame

diogosilva30 avatar Aug 26 '20 15:08 diogosilva30

@pplonski If you want I can start working on rewriting the AutoML wrapper to support the above functions.

diogosilva30 avatar Aug 26 '20 15:08 diogosilva30

@spamz23 sounds good! What do you mean about AutoML wrapper? I think it will be enough to add missing methods to the AutoML class.

Regarding fit(), if there are used data without column names, then column names there should be added (they are needed in the Explanation step). We can generate feature names: feature_{index}.

pplonski avatar Aug 26 '20 16:08 pplonski

@pplonski I mean add the missing methods, not creating a new wrapper.

diogosilva30 avatar Aug 26 '20 16:08 diogosilva30

@spamz23 I assigned this issue to you! If you need any help, I'm happy to help. You can join mljar Slack channel.

Please check contributing docs. There is a Contributor License Agreement form to be filled.

pplonski avatar Aug 26 '20 18:08 pplonski

Ok, I'm working on it. Can I use numpy documentation standard's?

diogosilva30 avatar Aug 26 '20 19:08 diogosilva30

Great! Why not use the scikit-learn docs standard? I have a plan to set up a docs website with mkdocs-material. Which one numpy or scikit-learn docs standard will be better to generate API docs? Do you have any idea?

pplonski avatar Aug 26 '20 20:08 pplonski

Hey! Scikit-Learn uses numpy docstring. Numpy docstring is also interpreted by Sphinx.

diogosilva30 avatar Aug 26 '20 20:08 diogosilva30

Great, let's use it!

pplonski avatar Aug 26 '20 20:08 pplonski

Hey! Found out that in order to be scikit-learn compatible, AutoML must be divided into AutoMLClassifier and AutoMLRegressor. You can consult the rules that it must follow to be compatible here at https://scikit-learn.org/stable/developers/develop.html @pplonski do you consent this change?

diogosilva30 avatar Aug 27 '20 03:08 diogosilva30

No, it doesn't - from what I read. It needs to implement a specific interface, which should act in a different way for classifiers and regressors.

Can you point where you find this information (exact place)? I'm searching and I can't find it (maybe I missed something).

pplonski avatar Aug 27 '20 07:08 pplonski

I think it is easy to get lost in the scikit-learn docs. Let's be pragmatic. I can think of one use case when the scikit-learn interface will be needed, it is Pipeline. For example, the user would like to use some custom data preparation before calling AutoML. Let's check which methods are needed in Pipeline and make it compatible with it. Should be easier than digging the docs.

What do you think @spamz23 ?

pplonski avatar Aug 27 '20 08:08 pplonski

Hey @pplonski ! I think it doesn't specify that directly. But take a look at the following section: Some common functionality depends on the kind of estimator passed. For example, cross-validation in model_selection.GridSearchCV and model_selection.cross_val_score defaults to being stratified when used on a classifier, but not otherwise. Similarly, scorers for average precision that take a continuous prediction need to call decision_function for classifiers, but predict for regressors. This distinction between classifiers and regressors is implemented using the _estimator_type attribute, which takes a string value. It should be "classifier" for classifiers and "regressor" for regressors and "clusterer" for clustering methods, to work as expected. Inheriting from ClassifierMixin, RegressorMixin or ClusterMixin will set the attribute automatically. When a meta-estimator needs to distinguish among estimator types, instead of checking _estimator_type directly, helpers like base.is_classifier should be used. Scikit-learn has a test to check is an estimator is fully compatible. Right now is still failing, and I think it will always fail until '_estimator_type' is set. I think the easiest way to implement this is to make AutoML a base class. Then, we can create two classes ( AutoMLClassifier and AutoMLRegressor) that inherited from the base class. I think this would also make the code more scalable and readable. What's your thoughts?

diogosilva30 avatar Aug 27 '20 12:08 diogosilva30

@spamz23 good catch. Hmmm, I think we can just set the _estimator_type and when ml_task is known then set the proper value to it. No inheritance needed.

I would like to keep AutoML as a single class. For me, it is a part of "Auto ML" to distinguish what to do, regression or classification.

pplonski avatar Aug 27 '20 12:08 pplonski

Is MLJar sklearn compatible now? I saw that when I call sklearn.calibration.is_classifier(automl) it is false because there's no _estimator_type attribute defined.

offchan42 avatar Nov 23 '21 20:11 offchan42

Looks like it was compatible and after sklearn 1.0 release some attributes might be missing.

pplonski avatar Nov 23 '21 20:11 pplonski

Another attribute that is missing is the classes_ attribute. I'm trying to call this function sklearn.calibration.CalibrationDisplay.from_estimator(automl, X_train, y_train, n_bins=10, name='train') which is basically just plotting reliability curve of a classifier.

offchan42 avatar Nov 23 '21 20:11 offchan42