auto-sklearn
auto-sklearn copied to clipboard
Add model export to sklearn
Fixes #388 by adding a function to_sklearn. This is a first draft for the interface, and I'd be very happy about feedback.
TODOs
- [ ] unit tests
- [ ] test regression
- [ ] test cross-validation
- [ ] test the label encoder in the ensemble
- [ ] test printing in jupyter notebooks
- [ ] test that fit() can be called on the extracted sklearn object and on an sklearn object produced by calling the generated code
- [ ] test what happens with a custom component (especially if
to_sklearnis not implemented) - [x] update single best ensemble
- [ ] update examples
- [ ] infer additional column transformer arguments via inspection
- [ ] infer additional pipeline arguments via inspection
- [ ] check mypy ignores added
- [ ] check open TODOs in the example below
- [ ] Use scikitlearn estimator test
- [ ] document this and the idea of what to export
Example
import os
import pickle
import types
import numpy as np
import sklearn.datasets
import sklearn.preprocessing
import autosklearn.estimators
X, y = sklearn.datasets.fetch_openml(data_id=40981, as_frame=True, return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
pickle_name = 'model.pkl'
if not os.path.exists(pickle_name):
cls = autosklearn.estimators.AutoSklearnClassifier(time_left_for_this_task=60)
cls.fit(X_train, y_train)
with open(pickle_name, 'wb') as fh:
pickle.dump(cls, fh)
else:
with open(pickle_name, 'rb') as fh:
cls = pickle.load(fh)
cls.predict(X_test)
def verify_only_sklearn_objects(obj):
# print(obj, type(obj), flush=True)
if (
obj is None
or isinstance(obj, (int, float, str))
or isinstance(obj, types.FunctionType)
or isinstance(obj, (np.random.RandomState, np.int32, np.int64,
np.uint8, np.uint32, np.uint64,
np.void, np.float64, np.bool_))
or obj in (np.float64, np.bool_)
):
return
elif isinstance(obj, (list, tuple, np.ndarray, set)):
pass
elif obj.__class__.__module__.startswith('sklearn.'):
pass
elif obj.__class__.__module__.startswith('autosklearn.pipeline.implementations.'):
pass
else:
raise TypeError((type(obj), obj))
if hasattr(obj, '__dict__'):
for key in vars(obj):
verify_only_sklearn_objects(vars(obj)[key])
elif isinstance(obj, (list, tuple, np.ndarray, set)):
for entry in obj:
verify_only_sklearn_objects(entry)
elif obj.__class__.__module__.startswith('sklearn.'):
# These are private sklearn objects
pass
else:
raise TypeError((type(obj), obj))
# TODO what about the stuff from validation.py that's done prior to fitting?
# TODO add necessary imports! - also add the full class names
# TODO what about the random states? Set them as integers in auto-sklearn to be reproducible?
# TODO Improve the printing to be more readable
# TODO add a few tests that the export is done correctly
extracted_model = cls.to_sklearn()
verify_only_sklearn_objects(extracted_model)
print(extracted_model.__repr__(N_CHAR_MAX=100000))
Output
VotingClassifier(estimators=[Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
'passthrough'),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='most_frequent')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
StandardScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
SelectFromModel(estimator=ExtraTreesClassifier(class_weight='balanced',
max_features=12,
min_samples_split=16,
n_jobs=1,
random_state=1),
prefit=True,
threshold='mean')),
('classifier',
MLPClassifier(alpha=0.003989533567739603,
beta_1=0.999,
beta_2=0.9,
early_stopping=True,
hidden_layer_sizes=(264,
264),
learning_rate_init=0.0009934511776384044,
max_iter=32,
n_iter_no_change=32,
random_state=1,
verbose=0,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.01)),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False)),
('variance_threshold',
VarianceThreshold()),
('rescaling',
StandardScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
'passthrough'),
('classifier',
RandomForestClassifier(max_features=6,
n_estimators=512,
n_jobs=1,
random_state=1,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.05897357701860171)),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False)),
('variance_threshold',
VarianceThreshold()),
('rescaling',
Normalizer(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
PolynomialFeatures(degree=3,
include_bias=False)),
('classifier',
AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(max_depth=2),
learning_rate=0.13167493237005792,
n_estimators=56,
random_state=1))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
'passthrough'),
('categorical_encoding',
'passthrough')]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
QuantileTransformer(copy=False,
n_quantiles=937))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
RandomTreesEmbedding(n_estimators=10,
n_jobs=1,
random_state=1)),
('classifier',
RandomForestClassifier(criterion='entropy',
max_features=16,
n_estimators=512,
n_jobs=1,
random_state=1,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.11533421526707399)),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
MinMaxScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
SelectFromModel(estimator=ExtraTreesClassifier(class_weight='balanced',
max_features=15,
min_samples_leaf=9,
n_jobs=1,
random_state=1),
prefit=True,
threshold='mean')),
('classifier',
RandomForestClassifier(criterion='entropy',
max_features=1,
n_estimators=512,
n_jobs=1,
random_state=1,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
'passthrough'),
('categorical_encoding',
'passthrough')]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
MinMaxScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
SelectFromModel(estimator=LinearSVC(C=13.550960330919455,
dual=False,
intercept_scaling=1.0,
penalty='l1',
random_state=1,
tol=1.2958033930435781e-05),
prefit=True,
threshold='mean')),
('classifier',
HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=0.005326508887463406,
learning_rate=0.060800813211425456,
max_iter=512,
max_leaf_nodes=6,
min_samples_leaf=5,
n_iter_no_change=5,
random_state=1,
validation_fraction=None,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.41826215858914706)),
('categorical_encoding',
'passthrough')]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='median')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
RobustScaler(copy=False,
quantile_range=(0.25595970768123566,
0.7305615609807856)))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
PolynomialFeatures(interaction_only=True)),
('classifier',
ExtraTreesClassifier(criterion='entropy',
max_features=102,
min_samples_leaf=2,
min_samples_split=20,
n_estimators=512,
n_jobs=1,
random_state=1,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.017116661677715188)),
('categorical_encoding',
'passthrough')]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False)),
('variance_threshold',
VarianceThreshold()),
('rescaling',
StandardScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
SelectFromModel(estimator=ExtraTreesClassifier(criterion='entropy',
max_features=6,
min_samples_leaf=4,
min_samples_split=17,
n_jobs=1,
random_state=1),
prefit=True,
threshold='mean')),
('classifier',
MLPClassifier(activation='tanh',
alpha=2.5550223982458062e-06,
beta_1=0.999,
beta_2=0.9,
hidden_layer_sizes=(54,
54,
54),
learning_rate_init=0.00027271287919467994,
max_iter=128,
n_iter_no_change=32,
random_state=1,
validation_fraction=0.0,
verbose=0,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
'passthrough'),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='most_frequent')),
('variance_threshold',
VarianceThreshold()),
('rescaling',
StandardScaler(copy=False))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
FeatureAgglomeration(linkage='complete',
n_clusters=42,
pooling_func=<function amax at 0x7f358248df70>)),
('classifier',
MLPClassifier(activation='tanh',
alpha=0.00021148999718383549,
beta_1=0.999,
beta_2=0.9,
hidden_layer_sizes=(113,
113,
113),
learning_rate_init=0.0007452270241186694,
max_iter=256,
n_iter_no_change=32,
random_state=1,
validation_fraction=0.0,
verbose=0,
warm_start=True))]),
Pipeline(steps=[('data_preprocessor',
ColumnTransformer(sparse_threshold=0.0,
transformers=[('categorical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False,
strategy='constant')),
('encoding',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1)),
('category_shift',
CategoryShift()),
('category_coalescence',
MinorityCoalescer(minimum_fraction=0.002102242030216922)),
('categorical_encoding',
OneHotEncoder(handle_unknown='ignore',
sparse=False))]),
['A1',
'A4',
'A5',
'A6',
'A8',
'A9',
'A11',
'A12']),
('numerical_transformer',
Pipeline(steps=[('imputation',
SimpleImputer(copy=False)),
('variance_threshold',
VarianceThreshold()),
('rescaling',
RobustScaler(copy=False,
quantile_range=(0.280953821785477,
0.7697572103377026)))]),
['A2',
'A3',
'A7',
'A10',
'A13',
'A14'])])),
('balancing', None),
('feature_preprocessor',
SelectFromModel(estimator=ExtraTreesClassifier(criterion='entropy',
max_features=4,
min_samples_leaf=4,
min_samples_split=7,
n_jobs=1,
random_state=1),
prefit=True,
threshold='mean')),
('classifier',
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3),
learning_rate=0.046269426995092074,
n_estimators=406,
random_state=1))])],
weights=[0.08, 0.14, 0.16, 0.26, 0.04, 0.06, 0.02, 0.04, 0.02,
0.18])
Codecov Report
Merging #1375 (e269b62) into development (a9fbd5c) will decrease coverage by
0.37%. The diff coverage is46.25%.
@@ Coverage Diff @@
## development #1375 +/- ##
===============================================
- Coverage 88.07% 87.69% -0.38%
===============================================
Files 140 140
Lines 10993 11048 +55
===============================================
+ Hits 9682 9689 +7
- Misses 1311 1359 +48
| Impacted Files | Coverage Δ | |
|---|---|---|
| autosklearn/ensembles/ensemble_selection.py | 63.31% <22.72%> (-6.29%) |
:arrow_down: |
| autosklearn/ensembles/singlebest_ensemble.py | 82.35% <33.33%> (-13.95%) |
:arrow_down: |
| ...line/components/data_preprocessing/feature_type.py | 85.71% <40.00%> (-2.41%) |
:arrow_down: |
| autosklearn/pipeline/base.py | 86.99% <50.00%> (-0.68%) |
:arrow_down: |
| ...mponents/data_preprocessing/balancing/balancing.py | 85.50% <50.00%> (-1.06%) |
:arrow_down: |
| autosklearn/automl.py | 87.76% <57.14%> (-0.29%) |
:arrow_down: |
| autosklearn/pipeline/components/base.py | 77.27% <64.28%> (-1.52%) |
:arrow_down: |
| autosklearn/estimators.py | 93.39% <66.66%> (-0.39%) |
:arrow_down: |
| autosklearn/ensembles/abstract_ensemble.py | 88.88% <87.50%> (ø) |
|
| ...ponents/feature_preprocessing/select_percentile.py | 84.61% <0.00%> (-7.70%) |
:arrow_down: |
| ... and 4 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update a9fbd5c...e269b62. Read the comment docs.
hi when its avaiable? i use autosklearn for production with kserve but autosklearn not support