auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

convert to scikit learn code.

Open palapalamao opened this issue 7 years ago • 20 comments

[(0.666667, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'adaboost', 'rescaling:choice': 'minmax', 'one_hot_encoding:minimum_fraction': 0.010000000000000004, 'regressor:adaboost:learning_rate': 0.9890631979261445, 'regressor:adaboost:loss': 'linear', 'regressor:adaboost:max_depth': 10, 'regressor:adaboost:n_estimators': 127}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.333333, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'random_trees_embedding', 'regressor:choice': 'liblinear_svr', 'rescaling:choice': 'standardize', 'one_hot_encoding:minimum_fraction': 0.00011808426850838513, 'preprocessor:random_trees_embedding:max_depth': 3, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 3, 'preprocessor:random_trees_embedding:min_samples_split': 3, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 68, 'regressor:liblinear_svr:C': 1.4174149191248073, 'regressor:liblinear_svr:dual': 'False', 'regressor:liblinear_svr:epsilon': 0.0328370684051209, 'regressor:liblinear_svr:fit_intercept': 'True', 'regressor:liblinear_svr:intercept_scaling': 1, 'regressor:liblinear_svr:loss': 'squared_epsilon_insensitive', 'regressor:liblinear_svr:tol': 0.0012221149693867595}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), ] R2 score: 0.87227602958 How to convert the model I run to sklearn code?could you give me some example code?

palapalamao avatar Nov 14 '17 04:11 palapalamao

Have a look at https://github.com/automl/auto-sklearn/issues/30. It would actually be great if the returned ensemble would be a pure scikit-learn model. Not sure how to achieve this, though.

mfeurer avatar Nov 15 '17 17:11 mfeurer

@mfeurer is it still relevant? can i get more information about what is required here?

activaigor avatar Jan 31 '19 17:01 activaigor

I think it would still be great to have this feature. Basically, the final model/ensemble needs to be converted to a pure scikit-learn code. Similarly to show_models() this would print a representation of the models found by Auto-sklearn, but one that could be pasted into python to instantiate standalone scikit-learn code.

How familiar are you with Auto-sklearn?

mfeurer avatar Feb 06 '19 16:02 mfeurer

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.

github-actions[bot] avatar May 05 '21 01:05 github-actions[bot]

@mfeurer I'd love to work on this if it's still considered beneficial.

GeorgePearse avatar Dec 18 '21 19:12 GeorgePearse

Hi @GeorgePearse,

Having something like TPOT's export is something I think me and @mfeurer had discussed a while ago but feel free to correct me if I'm wrong there @mfeurer.

There used to be more activity around this feature request from what I remember but I'm sure people would find this feature useful in scenarios where they would like to strip away auto-sklearn for the end model used in production cycles, or to simply play around.

We have a PR (#1321) by another user at the moment that gives access to the underlying models by updating the show_models() function to return a dict mapping from keys to different pipeline steps and the models. This could be used as a basis for accessing all the different components of the end optimization result, where the models would have to be extracted from our wrapper components and then have their hyper parameters filled in correctly.

I'm not entirely sure what is the best process to convert this into a pure Python scikit learn script but having access to the Config of ConfigSpace that generated the models would be hugely beneficial as that is how we instansiate them. These configs can also be printed out as a dict which could make setting up model creation quite easy. I figure this is the most difficult step and the one people would want automated the most, instansiating the models with the hyperparameters we found to be best.

I will get back to you if I can think of any other helpful pointer but I would be happy to help out and discuss for getting this feature in :)

eddiebergman avatar Dec 18 '21 20:12 eddiebergman

Cheers @eddiebergman it's an interesting problem, just looking through TPOT's implementation now and will start digging into the internals of this repo in a second.

GeorgePearse avatar Dec 18 '21 20:12 GeorgePearse

As a warning I'm unlikely to give this a real crack until the 26th Dec 2021 and beyond. If anyone can give it a go before then by all means go for it.

GeorgePearse avatar Dec 19 '21 20:12 GeorgePearse

No problem, you probably won't get much feedback until mid January in that case, feel free to work on it before then if you'd like but there's no rush, thanks for the contribution offer :)

eddiebergman avatar Dec 20 '21 11:12 eddiebergman

I once had a look into this feature, but coded it as a standalone function. Instead it should be built into Auto-sklearn, for example, one should be able to do classifier.export_to_sklearn() and get a scikit-learn-only model. Nevertheless, here's the code for reference:

import os
import pickle
import types

import numpy as np
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing

import autosklearn.estimators
import autosklearn.pipeline.base
import autosklearn.pipeline.components.base
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing
import autosklearn.pipeline.components.data_preprocessing.balancing.balancing
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_numerical
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_categorical


bunch = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch['target'].to_numpy()
X = bunch['data'].to_numpy(np.float)

X_train, X_test, y_train, y_test = \
     sklearn.model_selection.train_test_split(X, y, random_state=1)
feat_type = ['Categorical' if x.name == 'category' else 'Numerical' for x in bunch['data'].dtypes]


pickle_name = 'model.pkl'
if not os.path.exists(pickle_name):
    cls = autosklearn.estimators.AutoSklearnClassifier(time_left_for_this_task=60)
    cls.fit(X_train, y_train, feat_type=feat_type)
    with open(pickle_name, 'wb') as fh:
        pickle.dump(cls, fh)
else:
    with open(pickle_name, 'rb') as fh:
        cls = pickle.load(fh)


askl_ensemble = sklearn.ensemble.VotingClassifier(estimators=None, voting='soft')
weights = []
models = []
for weight, identifier in zip(list(cls.automl_.ensemble_.weights_),
                              list(cls.automl_.ensemble_.identifiers_)):
    if weight == 0.0:
        continue
    weights.append(weight)
    try:
        models.append(cls.automl_.models_[identifier])
    except KeyError:
        print(cls.automl_.ensemble_)
        print(cls.automl_.ensemble_.identifiers_)
        print(cls.automl_.models_)
        raise

askl_ensemble.estimators = models
askl_ensemble.estimators_ = models
askl_ensemble.weights = weights
askl_ensemble.le_ = sklearn.preprocessing.LabelEncoder().fit(y_train)
askl_ensemble.classes_ = askl_ensemble.le_.classes_

#print(askl_ensemble.predict(X_test))
#print(cls.predict(X_test))

#print(askl_ensemble.__repr__(N_CHAR_MAX=100000))


def extract_sklearn_object(obj):
    if isinstance(obj, sklearn.ensemble.VotingClassifier):
        estimators = [extract_sklearn_object(estimator) for estimator in obj.estimators_]
        obj.estimators = estimators
        obj.estimators_ = estimators
        return obj
    elif isinstance(obj, autosklearn.pipeline.base.BasePipeline):
        steps = []
        for name, step in obj.steps:
            steps.append((name, extract_sklearn_object(step)))
        return sklearn.pipeline.Pipeline(
            steps=steps,
            memory=obj.memory,
            verbose=obj.verbose,
        )
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.data_preprocessing.DataPreprocessor):
        # TODO Make the auto-sklearn object an actual column transformer or make it a learnable
        #  attribute column_transformer_
        column_transformer = obj.column_transformer
        transformers = []
        for name, trans, column in column_transformer.transformers_:
            transformers.append((name, extract_sklearn_object(trans), column))
        column_transformer = sklearn.compose.ColumnTransformer(
            transformers=transformers,
            remainder=column_transformer.remainder,
            sparse_threshold=column_transformer.sparse_threshold,
            n_jobs=column_transformer.n_jobs,
            transformer_weights=column_transformer.transformer_weights,
            verbose=column_transformer.verbose,
        )
        return column_transformer
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnChoice):
        # TODO make choice a fit-recognizing attribute: obj.choice_
        return extract_sklearn_object(obj.choice)
    elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.balancing.balancing.Balancing):
        # TODO implement the actual behavior of weighting!!!
        return 'passthrough'
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm):
        # TODO make preprocessor preprocessor_
        return obj.preprocessor
    elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm):
        return obj.estimator
    else:
        raise TypeError(type(obj))


def verify_only_sklearn_objects(obj):
    if (
        obj is None
        or isinstance(obj, (int, float, str))
        or isinstance(obj, types.FunctionType)
        or isinstance(obj, (np.random.RandomState, np.int32, np.int64, np.uint32, np.uint64,
                            np.void, np.float64, np.bool_))
        or obj in (np.float64, np.bool_)
    ):
        return
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
        pass
    elif obj.__class__.__module__.startswith('sklearn.'):
        pass
    elif obj.__class__.__module__.startswith('autosklearn.pipeline.implementations.'):
        pass
    else:
        raise TypeError((type(obj), obj))

    if hasattr(obj, '__dict__'):
        for key in vars(obj):
            verify_only_sklearn_objects(vars(obj)[key])
    elif isinstance(obj, (list, tuple, np.ndarray, set)):
        for entry in obj:
            verify_only_sklearn_objects(entry)
    elif obj.__class__.__module__.startswith('sklearn.'):
        # These are private sklearn objects
        pass
    else:
        raise TypeError((type(obj), obj))


# TODO what about the stuff from validation.py that's done prior to fitting?
# TODO add necessary imports! - also add the full class names
# TODO what about the random states? Set them as integers in auto-sklearn to be reproducible?
# TODO Improve the printing to be more readable
# TODO add a few tests that the export is done correctly
extracted_model = extract_sklearn_object(askl_ensemble)
verify_only_sklearn_objects(extracted_model)
print(extracted_model.__repr__(N_CHAR_MAX=1000000))

Most importantly, I think every component should by itself know how to convert itself to a scikit-learn object instead of having all this information in a central function as shown here.

mfeurer avatar Dec 21 '21 08:12 mfeurer

Hey @GeorgePearse did you already get started? If not, I could have a look on Wednesday morning.

mfeurer avatar Jan 17 '22 14:01 mfeurer

Hi @mfeurer sorry for the radio silence. You go for it, didn't really get anywhere. Looking forward to seeing the implementation though.

GeorgePearse avatar Jan 17 '22 18:01 GeorgePearse

No worries, there's now a draft in #1375.

mfeurer avatar Jan 20 '22 13:01 mfeurer

Hi @mfeurer, this feature is quite useful for us as we'd like to ultimately use kserve to serve the autosklearn models. I took a look at the draft, will only the best model be considered? Or will there be a way to export the other models found during the trial?

mereldawu avatar Jan 23 '22 17:01 mereldawu

Yes and no. This will add a functionality to the class AutoSklearnClassifier that will only export the models that are part of the ensemble. But also, this will add an export function to each individual model. As @eddiebergman pointed out in #1376 he is working on a function to easily access all models stored on disk, so it will be possible.

mfeurer avatar Jan 24 '22 08:01 mfeurer

Hi @mfeurer, thank you for working on this issue! This will allow to leverage the power of autosklearn in production pipelines that are implemented for sklearn-pipelines. Looking forward to using this feature.

roch-gla avatar Apr 21 '22 13:04 roch-gla

Hi, @mfeurer , how can we use the function to_sklearn() in the latest version? I can't find this function inside AutoSklearnClassifier now.

xieleo5 avatar Jan 17 '23 20:01 xieleo5

Hey, I would like to contribute to this issue. Please assign this to me.

kunjshukla avatar Jul 03 '23 14:07 kunjshukla