auto-sklearn
auto-sklearn copied to clipboard
[Question] How to get the pre-processed data used by `auto-sklearn` to train a model?
I would like to get the pre-processed data that was used to train a model.
How did this question come about?
The preprocessed data could be used to, for example, calculate its summary statistics and then compare with the un-transformed data or with the data preprocessed with different methods.
Would a small code snippet help?
This question is relevant to a standard application of AutoSklearnClassifier
function based on the example given in the docs.
Here's a snippet anyways:
import sklearn.datasets
import sklearn.metrics
import autosklearn.classification
X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
X, y, random_state=1
)
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=120,
per_run_time_limit=30,
tmp_folder="/tmp/autosklearn_classification_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
## get configuration for a model/run
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]
config=automl.automl_.runhistory_.ids_config[run_key.config_id]
print(config)
Configuration(values={
'balancing:strategy': 'weighting',
'classifier:__choice__': 'gradient_boosting',
'classifier:gradient_boosting:early_stop': 'off',
'classifier:gradient_boosting:l2_regularization': 0.5536468700597662,
'classifier:gradient_boosting:learning_rate': 0.023910336277631047,
'classifier:gradient_boosting:loss': 'auto',
'classifier:gradient_boosting:max_bins': 255,
'classifier:gradient_boosting:max_depth': 'None',
'classifier:gradient_boosting:max_leaf_nodes': 12,
'classifier:gradient_boosting:min_samples_leaf': 4,
'classifier:gradient_boosting:scoring': 'loss',
'classifier:gradient_boosting:tol': 1e-07,
'data_preprocessor:__choice__': 'feature_type',
'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
'feature_preprocessor:__choice__': 'pca',
})
What have you already looked at?
I have already looked at
- The Documentation, Examples and Issues, but I could not find any direct solution there.
- I looked into the
tmp_folder
ssmac3-output
and.auto-sklearn
folders, but could not find any files containing preprocessed data or relevant information that I could use to get the preprocessed data. - I tried filtering the configuration and providing it to
AutoSklearnPreprocessingAlgorithm
function, for which I repeatedly gotNot implemented
errors. - I tried creating a custom sklearn
Pipeline
using the pre-processing functions from autosklearn e.g. rescaling fromdata_preprocessing
module, but i found that this approach was not directly compatible with the configuration requirements ofauto-sklearn
.
Suggestion
A couple of functions could be implemented to (1) filter the configuration for a fitted model to keep only the keys related to the pre-processing steps, and then (2) run the corresponding steps to get the preprocessed data. For example, the code could look like this:
# Note: dummy code
import autosklearn.preprocessing
## filtered configuration
config_preprocessing=autosklearn.preprocessing.get_config_preprocessing(config)
print(config_preprocessing)
# Configuration(values={
# 'data_preprocessor:__choice__': 'feature_type',
# 'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
# 'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
# 'feature_preprocessor:__choice__': 'pca',
#})
## get the preprocessed data
X_preprocessed=autosklearn.preprocessing.fit_transform(
X=X,
configuration=config_preprocessing,
)
This is just a suggestion. If there is any other way of obtaining the pre-processed data, please let me know.
System Details (if relevant)
- Version of
auto-sklearn
: 0.15.0 - Running this on Linux.