auto-sklearn icon indicating copy to clipboard operation
auto-sklearn copied to clipboard

[Question] How to get the pre-processed data used by `auto-sklearn` to train a model?

Open rraadd88 opened this issue 8 months ago • 0 comments

I would like to get the pre-processed data that was used to train a model.

How did this question come about?

The preprocessed data could be used to, for example, calculate its summary statistics and then compare with the un-transformed data or with the data preprocessed with different methods.

Would a small code snippet help?

This question is relevant to a standard application of AutoSklearnClassifier function based on the example given in the docs. Here's a snippet anyways:

import sklearn.datasets
import sklearn.metrics
import autosklearn.classification

X, y = sklearn.datasets.load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    X, y, random_state=1
)
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
    tmp_folder="/tmp/autosklearn_classification_example_tmp",
)
automl.fit(X_train, y_train, dataset_name="breast_cancer")
## get configuration for a model/run
run_key = list(automl.automl_.runhistory_.data.keys())[0]
run_value = automl.automl_.runhistory_.data[run_key]
config=automl.automl_.runhistory_.ids_config[run_key.config_id]
print(config)
Configuration(values={
  'balancing:strategy': 'weighting',
  'classifier:__choice__': 'gradient_boosting',
  'classifier:gradient_boosting:early_stop': 'off',
  'classifier:gradient_boosting:l2_regularization': 0.5536468700597662,
  'classifier:gradient_boosting:learning_rate': 0.023910336277631047,
  'classifier:gradient_boosting:loss': 'auto',
  'classifier:gradient_boosting:max_bins': 255,
  'classifier:gradient_boosting:max_depth': 'None',
  'classifier:gradient_boosting:max_leaf_nodes': 12,
  'classifier:gradient_boosting:min_samples_leaf': 4,
  'classifier:gradient_boosting:scoring': 'loss',
  'classifier:gradient_boosting:tol': 1e-07,
  'data_preprocessor:__choice__': 'feature_type',
  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
  'feature_preprocessor:__choice__': 'pca',
})

What have you already looked at?

I have already looked at

  1. The Documentation, Examples and Issues, but I could not find any direct solution there.
  2. I looked into the tmp_folders smac3-output and .auto-sklearn folders, but could not find any files containing preprocessed data or relevant information that I could use to get the preprocessed data.
  3. I tried filtering the configuration and providing it to AutoSklearnPreprocessingAlgorithm function, for which I repeatedly got Not implemented errors.
  4. I tried creating a custom sklearn Pipeline using the pre-processing functions from autosklearn e.g. rescaling from data_preprocessing module, but i found that this approach was not directly compatible with the configuration requirements of auto-sklearn.

Suggestion

A couple of functions could be implemented to (1) filter the configuration for a fitted model to keep only the keys related to the pre-processing steps, and then (2) run the corresponding steps to get the preprocessed data. For example, the code could look like this:

# Note: dummy code
import autosklearn.preprocessing

## filtered configuration
config_preprocessing=autosklearn.preprocessing.get_config_preprocessing(config)
print(config_preprocessing)
# Configuration(values={
#  'data_preprocessor:__choice__': 'feature_type',
#  'data_preprocessor:feature_type:numerical_transformer:imputation:strategy': 'most_frequent',
#  'data_preprocessor:feature_type:numerical_transformer:rescaling:__choice__': 'standardize',
#  'feature_preprocessor:__choice__': 'pca',
#})

## get the preprocessed data
X_preprocessed=autosklearn.preprocessing.fit_transform(
  X=X,
  configuration=config_preprocessing,
)

This is just a suggestion. If there is any other way of obtaining the pre-processed data, please let me know.

System Details (if relevant)

  • Version of auto-sklearn: 0.15.0
  • Running this on Linux.

rraadd88 avatar Oct 07 '23 22:10 rraadd88