palapalamao opened this issue 7 years ago • 20 comments

[(0.666667, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'adaboost', 'rescaling:choice': 'minmax', 'one_hot_encoding:minimum_fraction': 0.010000000000000004, 'regressor:adaboost:learning_rate': 0.9890631979261445, 'regressor:adaboost:loss': 'linear', 'regressor:adaboost:max_depth': 10, 'regressor:adaboost:n_estimators': 127}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.333333, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'random_trees_embedding', 'regressor:choice': 'liblinear_svr', 'rescaling:choice': 'standardize', 'one_hot_encoding:minimum_fraction': 0.00011808426850838513, 'preprocessor:random_trees_embedding:max_depth': 3, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 3, 'preprocessor:random_trees_embedding:min_samples_split': 3, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 68, 'regressor:liblinear_svr:C': 1.4174149191248073, 'regressor:liblinear_svr:dual': 'False', 'regressor:liblinear_svr:epsilon': 0.0328370684051209, 'regressor:liblinear_svr:fit_intercept': 'True', 'regressor:liblinear_svr:intercept_scaling': 1, 'regressor:liblinear_svr:loss': 'squared_epsilon_insensitive', 'regressor:liblinear_svr:tol': 0.0012221149693867595}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), ] R2 score: 0.87227602958 How to convert the model I run to sklearn code?could you give me some example code?

palapalamao

Have a look at It would actually be great if the returned ensemble would be a pure scikit-learn model. Not sure how to achieve this, though.

mfeurer

@mfeurer is it still relevant? can i get more information about what is required here?

activaigor

I think it would still be great to have this feature. Basically, the final model/ensemble needs to be converted to a pure scikit-learn code. Similarly to show_models() this would print a representation of the models found by Auto-sklearn, but one that could be pasted into python to instantiate standalone scikit-learn code.

How familiar are you with Auto-sklearn?

mfeurer

github-actions[bot]

@mfeurer I'd love to work on this if it's still considered beneficial.

GeorgePearse

Hi @GeorgePearse,

Having something like TPOT's export is something I think me and @mfeurer had discussed a while ago but feel free to correct me if I'm wrong there @mfeurer.

There used to be more activity around this feature request from what I remember but I'm sure people would find this feature useful in scenarios where they would like to strip away auto-sklearn for the end model used in production cycles, or to simply play around.

We have a PR (#1321) by another user at the moment that gives access to the underlying models by updating the show_models() function to return a dict mapping from keys to different pipeline steps and the models. This could be used as a basis for accessing all the different components of the end optimization result, where the models would have to be extracted from our wrapper components and then have their hyper parameters filled in correctly.

I'm not entirely sure what is the best process to convert this into a pure Python scikit learn script but having access to the Config of ConfigSpace that generated the models would be hugely beneficial as that is how we instansiate them. These configs can also be printed out as a dict which could make setting up model creation quite easy. I figure this is the most difficult step and the one people would want automated the most, instansiating the models with the hyperparameters we found to be best.

I will get back to you if I can think of any other helpful pointer but I would be happy to help out and discuss for getting this feature in :)

eddiebergman

Cheers @eddiebergman it's an interesting problem, just looking through TPOT's implementation now and will start digging into the internals of this repo in a second.

GeorgePearse

As a warning I'm unlikely to give this a real crack until the 26th Dec 2021 and beyond. If anyone can give it a go before then by all means go for it.

GeorgePearse

No problem, you probably won't get much feedback until mid January in that case, feel free to work on it before then if you'd like but there's no rush, thanks for the contribution offer :)

eddiebergman

I once had a look into this feature, but coded it as a standalone function. Instead it should be built into Auto-sklearn, for example, one should be able to do classifier.export_to_sklearn() and get a scikit-learn-only model. Nevertheless, here's the code for reference:

import os
import pickle
import types

import numpy as np
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

import autosklearn.estimators
import autosklearn.pipeline.base
import autosklearn.pipeline.components.base
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing
import autosklearn.pipeline.components.data_preprocessing.balancing.balancing
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_numerical
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_categorical

bunch = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch['target'].to_numpy()
X = bunch['data'].to_numpy(np.float)

X_train, X_test, y_train, y_test = \
     sklearn.model_selection.train_test_split(X, y, random_state=1)
feat_type = ['Categorical' if == 'category' else 'Numerical' for x in bunch['data'].dtypes]

pickle_name = 'model.pkl'
if not os.path.exists(pickle_name):
    cls = autosklearn.estimators.AutoSklearnClassifier(time_left_for_this_task=60), y_train, feat_type=feat_type)
    with open(pickle_name, 'wb') as fh:
        pickle.dump(cls, fh)
    with open(pickle_name, 'rb') as fh:
        cls = pickle.load(fh)

askl_ensemble = sklearn.ensemble.VotingClassifier(estimators=None, voting='soft')
weights = []
models = []
for weight, identifier in zip(list(cls.automl_.ensemble_.weights_),
    if weight == 0.0:
    except KeyError:

askl_ensemble.estimators = models
askl_ensemble.estimators_ = models
askl_ensemble.weights = weights
askl_ensemble.le_ = sklearn.preprocessing.LabelEncoder().fit(y_train)
askl_ensemble.classes_ = askl_ensemble.le_.classes_



def extract_sklearn_object(obj):
    if isinstance(obj, sklearn.ensemble.VotingClassifier):
        estimators = [extract_sklearn_object(estimator) for estimator in obj.estimators_]
        obj.estimators = estimators
        obj.estimators_ = estimators
        return obj
    elif isinstance(obj, autosklearn.pipeline.base.BasePipeline):
        steps = []
        for name, step in obj.steps:
            steps.append((name, extract_sklearn_object(step)))
        return sklearn.pipeline.Pipeline(
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.data_preprocessing.DataPreprocessor):
        # TODO Make the auto-sklearn object an actual column transformer or make it a learnable
        #  attribute column_transformer_
        column_transformer = obj.column_transformer
        transformers = []
        for name, trans, column in column_transformer.transformers_:
            transformers.append((name, extract_sklearn_object(trans), column))
        column_transformer = sklearn.compose.ColumnTransformer(
        return column_transformer
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnChoice):
        # TODO make choice a fit-recognizing attribute: obj.choice_
        return extract_sklearn_object(obj.choice)
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.balancing.balancing.Balancing):
        # TODO implement the actual behavior of weighting!!!
        return 'passthrough'
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm):
        # TODO make preprocessor preprocessor_
        return obj.preprocessor
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm):
        return obj.estimator
        raise TypeError(type(obj))

def verify_only_sklearn_objects(obj):
    if (
        obj is None
        or isinstance(obj, (int, float, str))
        or isinstance(obj, types.FunctionType)
        or isinstance(obj, (np.random.RandomState, np.int32, np.int64, np.uint32, np.uint64,
                            np.void, np.float64, np.bool_))
        or obj in (np.float64, np.bool_)
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
    elif obj.__class__.__module__.startswith('sklearn.'):
    elif obj.__class__.__module__.startswith('autosklearn.pipeline.implementations.'):
        raise TypeError((type(obj), obj))

    if hasattr(obj, '__dict__'):
        for key in vars(obj):
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
        for entry in obj:
    elif obj.__class__.__module__.startswith('sklearn.'):
        # These are private sklearn objects
        raise TypeError((type(obj), obj))

# TODO what about the stuff from that's done prior to fitting?
# TODO add necessary imports! - also add the full class names
# TODO what about the random states? Set them as integers in auto-sklearn to be reproducible?
# TODO Improve the printing to be more readable
# TODO add a few tests that the export is done correctly
extracted_model = extract_sklearn_object(askl_ensemble)

Most importantly, I think every component should by itself know how to convert itself to a scikit-learn object instead of having all this information in a central function as shown here.

mfeurer

Hey @GeorgePearse did you already get started? If not, I could have a look on Wednesday morning.

mfeurer

Hi @mfeurer sorry for the radio silence. You go for it, didn't really get anywhere. Looking forward to seeing the implementation though.

GeorgePearse

No worries, there's now a draft in #1375.

mfeurer

Hi @mfeurer, this feature is quite useful for us as we'd like to ultimately use kserve to serve the autosklearn models. I took a look at the draft, will only the best model be considered? Or will there be a way to export the other models found during the trial?

mereldawu

Yes and no. This will add a functionality to the class AutoSklearnClassifier that will only export the models that are part of the ensemble. But also, this will add an export function to each individual model. As @eddiebergman pointed out in #1376 he is working on a function to easily access all models stored on disk, so it will be possible.

mfeurer

Hi @mfeurer, thank you for working on this issue! This will allow to leverage the power of autosklearn in production pipelines that are implemented for sklearn-pipelines. Looking forward to using this feature.

roch-gla

Hi, @mfeurer , how can we use the function to_sklearn() in the latest version? I can't find this function inside AutoSklearnClassifier now.

xieleo5

Hey, I would like to contribute to this issue. Please assign this to me.

kunjshukla