mljar-supervised
mljar-supervised copied to clipboard
Make AutoML scikit-learn compatible
I think it would be interesting to add a feature to export the best model with a scikit-learn wrapper. This would allow integrating the best AutoML model into a scikit-learn workflow. I think most of the models that AutoML uses are already from scikit-learn, and those who aren't do provide scikit-learn wrappers, so I think it would be easy to implement. Is there anything that makes this feature 'impossible' to implement?
Hey @spamz23! You are right. It is almost implemented in AutoML
(which is a wrapper for all algorithms). Maybe there is predict_proba
missing in AutoML
, but it is because the predict
method is returning probabilities and labels. This can be fixed (easily). Would you like to add it (predict_proba
in AutoML
)? I will help you.
Could you give a minimal code example how would you like to use AutoML
in scikit-learn workflow?
In order to provide skicit-learn compatibility, AutoML is missing:
- Function
predict_proba()
- Function
score()
to calculate accuracy - Function
get_params()
andset_params()
Suggestion:
-
fit()
function should acceptlist
ornumpy.ndarray
orpandas.DataFrame
@pplonski If you want I can start working on rewriting the AutoML wrapper to support the above functions.
@spamz23 sounds good! What do you mean about AutoML wrapper? I think it will be enough to add missing methods to the AutoML class.
Regarding fit()
, if there are used data without column names, then column names there should be added (they are needed in the Explanation step). We can generate feature names: feature_{index}
.
@pplonski I mean add the missing methods, not creating a new wrapper.
@spamz23 I assigned this issue to you! If you need any help, I'm happy to help. You can join mljar Slack channel.
Please check contributing docs. There is a Contributor License Agreement form to be filled.
Ok, I'm working on it. Can I use numpy documentation standard's?
Great! Why not use the scikit-learn docs standard? I have a plan to set up a docs website with mkdocs-material. Which one numpy or scikit-learn docs standard will be better to generate API docs? Do you have any idea?
Hey! Scikit-Learn uses numpy docstring. Numpy docstring is also interpreted by Sphinx.
Great, let's use it!
Hey! Found out that in order to be scikit-learn compatible, AutoML
must be divided into AutoMLClassifier
and AutoMLRegressor
. You can consult the rules that it must follow to be compatible here at https://scikit-learn.org/stable/developers/develop.html
@pplonski do you consent this change?
No, it doesn't - from what I read. It needs to implement a specific interface, which should act in a different way for classifiers and regressors.
Can you point where you find this information (exact place)? I'm searching and I can't find it (maybe I missed something).
I think it is easy to get lost in the scikit-learn docs. Let's be pragmatic. I can think of one use case when the scikit-learn interface will be needed, it is Pipeline
. For example, the user would like to use some custom data preparation before calling AutoML. Let's check which methods are needed in Pipeline
and make it compatible with it. Should be easier than digging the docs.
What do you think @spamz23 ?
Hey @pplonski ! I think it doesn't specify that directly. But take a look at the following section:
Some common functionality depends on the kind of estimator passed. For example, cross-validation in model_selection.GridSearchCV and model_selection.cross_val_score defaults to being stratified when used on a classifier, but not otherwise. Similarly, scorers for average precision that take a continuous prediction need to call decision_function for classifiers, but predict for regressors. This distinction between classifiers and regressors is implemented using the _estimator_type attribute, which takes a string value. It should be "classifier" for classifiers and "regressor" for regressors and "clusterer" for clustering methods, to work as expected. Inheriting from ClassifierMixin, RegressorMixin or ClusterMixin will set the attribute automatically. When a meta-estimator needs to distinguish among estimator types, instead of checking _estimator_type directly, helpers like base.is_classifier should be used.
Scikit-learn has a test to check is an estimator is fully compatible. Right now is still failing, and I think it will always fail until '_estimator_type' is set.
I think the easiest way to implement this is to make AutoML
a base class. Then, we can create two classes ( AutoMLClassifier
and AutoMLRegressor
) that inherited from the base class. I think this would also make the code more scalable and readable.
What's your thoughts?
@spamz23 good catch. Hmmm, I think we can just set the _estimator_type
and when ml_task
is known then set the proper value to it. No inheritance needed.
I would like to keep AutoML
as a single class. For me, it is a part of "Auto ML" to distinguish what to do, regression or classification.
Is MLJar sklearn compatible now?
I saw that when I call sklearn.calibration.is_classifier(automl)
it is false because there's no _estimator_type
attribute defined.
Looks like it was compatible and after sklearn 1.0 release some attributes might be missing.
Another attribute that is missing is the classes_
attribute. I'm trying to call this function sklearn.calibration.CalibrationDisplay.from_estimator(automl, X_train, y_train, n_bins=10, name='train')
which is basically just plotting reliability curve of a classifier.