auto-sklearn
auto-sklearn copied to clipboard
convert to scikit learn code.
[(0.666667, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'no_preprocessing', 'regressor:choice': 'adaboost', 'rescaling:choice': 'minmax', 'one_hot_encoding:minimum_fraction': 0.010000000000000004, 'regressor:adaboost:learning_rate': 0.9890631979261445, 'regressor:adaboost:loss': 'linear', 'regressor:adaboost:max_depth': 10, 'regressor:adaboost:n_estimators': 127}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), (0.333333, SimpleRegressionPipeline({'imputation:strategy': 'mean', 'one_hot_encoding:use_minimum_fraction': 'True', 'preprocessor:choice': 'random_trees_embedding', 'regressor:choice': 'liblinear_svr', 'rescaling:choice': 'standardize', 'one_hot_encoding:minimum_fraction': 0.00011808426850838513, 'preprocessor:random_trees_embedding:max_depth': 3, 'preprocessor:random_trees_embedding:max_leaf_nodes': 'None', 'preprocessor:random_trees_embedding:min_samples_leaf': 3, 'preprocessor:random_trees_embedding:min_samples_split': 3, 'preprocessor:random_trees_embedding:min_weight_fraction_leaf': 1.0, 'preprocessor:random_trees_embedding:n_estimators': 68, 'regressor:liblinear_svr:C': 1.4174149191248073, 'regressor:liblinear_svr:dual': 'False', 'regressor:liblinear_svr:epsilon': 0.0328370684051209, 'regressor:liblinear_svr:fit_intercept': 'True', 'regressor:liblinear_svr:intercept_scaling': 1, 'regressor:liblinear_svr:loss': 'squared_epsilon_insensitive', 'regressor:liblinear_svr:tol': 0.0012221149693867595}, dataset_properties={ 'task': 4, 'sparse': False, 'multilabel': False, 'multiclass': False, 'target_type': 'regression', 'signed': False})), ] R2 score: 0.87227602958 How to convert the model I run to sklearn code?could you give me some example code?
Have a look at https://github.com/automl/auto-sklearn/issues/30. It would actually be great if the returned ensemble would be a pure scikit-learn model. Not sure how to achieve this, though.
@mfeurer is it still relevant? can i get more information about what is required here?
I think it would still be great to have this feature. Basically, the final model/ensemble needs to be converted to a pure scikit-learn code. Similarly to show_models() this would print a representation of the models found by Auto-sklearn, but one that could be pasted into python to instantiate standalone scikit-learn code.
How familiar are you with Auto-sklearn?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.
@mfeurer I'd love to work on this if it's still considered beneficial.
Hi @GeorgePearse,
Having something like TPOT's export is something I think me and @mfeurer had discussed a while ago but feel free to correct me if I'm wrong there @mfeurer.
There used to be more activity around this feature request from what I remember but I'm sure people would find this feature useful in scenarios where they would like to strip away auto-sklearn for the end model used in production cycles, or to simply play around.
We have a PR (#1321) by another user at the moment that gives access to the underlying models by updating the show_models()
function to return a dict mapping from keys to different pipeline steps and the models. This could be used as a basis for accessing all the different components of the end optimization result, where the models would have to be extracted from our wrapper components and then have their hyper parameters filled in correctly.
I'm not entirely sure what is the best process to convert this into a pure Python scikit learn script but having access to the Config
of ConfigSpace
that generated the models would be hugely beneficial as that is how we instansiate them. These configs can also be printed out as a dict
which could make setting up model creation quite easy. I figure this is the most difficult step and the one people would want automated the most, instansiating the models with the hyperparameters we found to be best.
I will get back to you if I can think of any other helpful pointer but I would be happy to help out and discuss for getting this feature in :)
Cheers @eddiebergman it's an interesting problem, just looking through TPOT's implementation now and will start digging into the internals of this repo in a second.
As a warning I'm unlikely to give this a real crack until the 26th Dec 2021 and beyond. If anyone can give it a go before then by all means go for it.
No problem, you probably won't get much feedback until mid January in that case, feel free to work on it before then if you'd like but there's no rush, thanks for the contribution offer :)
I once had a look into this feature, but coded it as a standalone function. Instead it should be built into Auto-sklearn, for example, one should be able to do classifier.export_to_sklearn()
and get a scikit-learn-only model. Nevertheless, here's the code for reference:
import os
import pickle
import types
import numpy as np
import sklearn.datasets
import sklearn.ensemble
import sklearn.model_selection
import sklearn.pipeline
import sklearn.preprocessing
import autosklearn.estimators
import autosklearn.pipeline.base
import autosklearn.pipeline.components.base
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing
import autosklearn.pipeline.components.data_preprocessing.balancing.balancing
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_numerical
import autosklearn.pipeline.components.data_preprocessing.data_preprocessing_categorical
bunch = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True)
y = bunch['target'].to_numpy()
X = bunch['data'].to_numpy(np.float)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
feat_type = ['Categorical' if x.name == 'category' else 'Numerical' for x in bunch['data'].dtypes]
pickle_name = 'model.pkl'
if not os.path.exists(pickle_name):
cls = autosklearn.estimators.AutoSklearnClassifier(time_left_for_this_task=60)
cls.fit(X_train, y_train, feat_type=feat_type)
with open(pickle_name, 'wb') as fh:
pickle.dump(cls, fh)
else:
with open(pickle_name, 'rb') as fh:
cls = pickle.load(fh)
askl_ensemble = sklearn.ensemble.VotingClassifier(estimators=None, voting='soft')
weights = []
models = []
for weight, identifier in zip(list(cls.automl_.ensemble_.weights_),
list(cls.automl_.ensemble_.identifiers_)):
if weight == 0.0:
continue
weights.append(weight)
try:
models.append(cls.automl_.models_[identifier])
except KeyError:
print(cls.automl_.ensemble_)
print(cls.automl_.ensemble_.identifiers_)
print(cls.automl_.models_)
raise
askl_ensemble.estimators = models
askl_ensemble.estimators_ = models
askl_ensemble.weights = weights
askl_ensemble.le_ = sklearn.preprocessing.LabelEncoder().fit(y_train)
askl_ensemble.classes_ = askl_ensemble.le_.classes_
#print(askl_ensemble.predict(X_test))
#print(cls.predict(X_test))
#print(askl_ensemble.__repr__(N_CHAR_MAX=100000))
def extract_sklearn_object(obj):
if isinstance(obj, sklearn.ensemble.VotingClassifier):
estimators = [extract_sklearn_object(estimator) for estimator in obj.estimators_]
obj.estimators = estimators
obj.estimators_ = estimators
return obj
elif isinstance(obj, autosklearn.pipeline.base.BasePipeline):
steps = []
for name, step in obj.steps:
steps.append((name, extract_sklearn_object(step)))
return sklearn.pipeline.Pipeline(
steps=steps,
memory=obj.memory,
verbose=obj.verbose,
)
elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.data_preprocessing.DataPreprocessor):
# TODO Make the auto-sklearn object an actual column transformer or make it a learnable
# attribute column_transformer_
column_transformer = obj.column_transformer
transformers = []
for name, trans, column in column_transformer.transformers_:
transformers.append((name, extract_sklearn_object(trans), column))
column_transformer = sklearn.compose.ColumnTransformer(
transformers=transformers,
remainder=column_transformer.remainder,
sparse_threshold=column_transformer.sparse_threshold,
n_jobs=column_transformer.n_jobs,
transformer_weights=column_transformer.transformer_weights,
verbose=column_transformer.verbose,
)
return column_transformer
elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnChoice):
# TODO make choice a fit-recognizing attribute: obj.choice_
return extract_sklearn_object(obj.choice)
elif isinstance(obj, autosklearn.pipeline.components.data_preprocessing.balancing.balancing.Balancing):
# TODO implement the actual behavior of weighting!!!
return 'passthrough'
elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnPreprocessingAlgorithm):
# TODO make preprocessor preprocessor_
return obj.preprocessor
elif isinstance(obj, autosklearn.pipeline.components.base.AutoSklearnClassificationAlgorithm):
return obj.estimator
else:
raise TypeError(type(obj))
def verify_only_sklearn_objects(obj):
if (
obj is None
or isinstance(obj, (int, float, str))
or isinstance(obj, types.FunctionType)
or isinstance(obj, (np.random.RandomState, np.int32, np.int64, np.uint32, np.uint64,
np.void, np.float64, np.bool_))
or obj in (np.float64, np.bool_)
):
return
elif isinstance(obj, (list, tuple, np.ndarray, set)):
pass
elif obj.__class__.__module__.startswith('sklearn.'):
pass
elif obj.__class__.__module__.startswith('autosklearn.pipeline.implementations.'):
pass
else:
raise TypeError((type(obj), obj))
if hasattr(obj, '__dict__'):
for key in vars(obj):
verify_only_sklearn_objects(vars(obj)[key])
elif isinstance(obj, (list, tuple, np.ndarray, set)):
for entry in obj:
verify_only_sklearn_objects(entry)
elif obj.__class__.__module__.startswith('sklearn.'):
# These are private sklearn objects
pass
else:
raise TypeError((type(obj), obj))
# TODO what about the stuff from validation.py that's done prior to fitting?
# TODO add necessary imports! - also add the full class names
# TODO what about the random states? Set them as integers in auto-sklearn to be reproducible?
# TODO Improve the printing to be more readable
# TODO add a few tests that the export is done correctly
extracted_model = extract_sklearn_object(askl_ensemble)
verify_only_sklearn_objects(extracted_model)
print(extracted_model.__repr__(N_CHAR_MAX=1000000))
Most importantly, I think every component should by itself know how to convert itself to a scikit-learn object instead of having all this information in a central function as shown here.
Hey @GeorgePearse did you already get started? If not, I could have a look on Wednesday morning.
Hi @mfeurer sorry for the radio silence. You go for it, didn't really get anywhere. Looking forward to seeing the implementation though.
No worries, there's now a draft in #1375.
Hi @mfeurer, this feature is quite useful for us as we'd like to ultimately use kserve to serve the autosklearn models. I took a look at the draft, will only the best model be considered? Or will there be a way to export the other models found during the trial?
Yes and no. This will add a functionality to the class AutoSklearnClassifier
that will only export the models that are part of the ensemble. But also, this will add an export function to each individual model. As @eddiebergman pointed out in #1376 he is working on a function to easily access all models stored on disk, so it will be possible.
Hi @mfeurer, thank you for working on this issue! This will allow to leverage the power of autosklearn in production pipelines that are implemented for sklearn-pipelines. Looking forward to using this feature.
Hi, @mfeurer , how can we use the function to_sklearn()
in the latest version? I can't find this function inside AutoSklearnClassifier
now.
Hey, I would like to contribute to this issue. Please assign this to me.