auto-sklearn
auto-sklearn copied to clipboard
[Question] Understanding the models
[Question] Understanding the models
I need help with the following questions:
**Name of the columns **
- I want to find out the names of the columns finalized after the preprocessing and before starting the automl.fit(X,y) For example, if my dataset has 100 columns, how many were removed during the preprocessing steps, and how many were remaining?
Name of columns that required imputation
- What are the names of the columns that required imputation? How can I find the column names?
Any help would be appreciated.
Hi @timzewing,
I'm not sure there's an easy way to get at the final set of columns that your model will recieve. Since I know from the other issue, you are using a custom model, you could just print it out there in its fit function.
In general, all categoricals will be imputed. We do one hot encoding generally for any column labelled as categorical or object.
Best, Eddie
Hi @atifinqline,
Apologies, I mixed up as posting a different issue in which a user gave a custom model, my bad.
In general it's quite hard to categorize what happens for different models as we produce full pipelines which may have different feature preprocessors.
You can use the show_models() function to under stand what the components of each pipeline in the final ensemble are made of, here you can access the pipeline and manually pass it through each step as required. You will however not get column names as we convert to numpy quite early on for performance and legacy reasons, as this is the final input to the sklearn model and the component steps.
One thing we don't include in all of those is basically an InputValidator which is fit on your original data. You can access this post-fit with estimator.automl_.InputValidator.
from autosklearn.classification import AutoSklearnClassifier
clf = AutoSklearnClassifier(...)
clf.fit(X, y)
ensemble_dict = clf.show_models()
input_validator = clf.automl_.InputValidator
print(ensemble_dict)
a_pipeline = ensemble_dict[...]
Xt = input_validator.transform(X)
data_preprocessor = a_pipeline["data_preprocessor"]
balancer = a_pipeline["balancing"]
feature_preprocessor = a_pipeline["feature_preprocessor"]
classifier = a_pipeline["classifier"]
sklearn_classifier = a_pipeline["sklearn_classifier"]
These will be slightly different if using cv resampling in which case you will need to refer to the documentation and just play around with it.
Best, Eddie
@eddiebergman Do you plan to have this functionality in the future where we can see name of the columns that the model used to train the model?
Hi @timzewing,
That would be a nice feature, it comes into reworking some information that gets passed around the pipeline because as soon as we drop to a pure numpy array, this information gets lost. Makes me think we need to essentially have a pure pandas pipeline right until the moment it hits the sklearn model.
I will be back to working on auto-sklearn in October and will keep this in mind, I would like to configure the pipeline components accordingly.
Best, Eddie
We could probably achieve this with the function get_feature_names_out, but what would we do if we actually construct extra features?