PDPbox
PDPbox copied to clipboard
Generating plots with sklearn Pipeline objects
Thanks for creating such a tool for Python partial dependence plot. I do find an issue, though. Right now in my project, the trained model is wrapped as a pipeline. Incoming data would have a handful number of features, and the categorical ones will be transformed into one-hot by preprocessors within the pipeline object. PDPbox works fine when I'm calling the pipeline and a numerical feature that is available in the test dataframe. However, things get interesting when I'm trying to plot a one-hot encoded categorical feature...
- Cannot pass the original dataframe and the list of one-hot encoded feature names: the feature names are not found in the dataframe.
- Cannot pass the transformed dataframe (by first extracting the preprocessor from the sklearn pipeline and applying it on the data) and the list of one-hot encoded feature names: the package only accepts Pandas dataframe (error message:
ValueError: only accept pandas DataFrame
) - Cannot pass the original dataframe and the original name of the feature: as the feature is one-hot encoded in the pipeline, plots cannot be generated correctly.
Is there a way to better support sklearn Pipeline object? Ideally, users should be able to pass a pipeline and one-hot encoded feature names as arguments.
Hello, I'm facing the same problem. Is there any workaround for this issue?
Hi, just started using Pdpbox following Kaggle courses. And I stumbled upon the same limitation.
My workaround is to not use the pipeline as is, but rather apply the model on the preprocessed data.
I manually create the pandas DataFrame from the preprocessor-transformed data.
The not-so-trivial part is to recover the name of the features (especially the one created by the OneHotEncoder).
For this, I use an edit of the function get_column_names_from_ColumnTransformer
of this thread: https://github.com/scikit-learn/scikit-learn/issues/12525#issuecomment-640900712(https://github.com/scikit-learn/scikit-learn/issues/12525#issuecomment-640900712)
In the end, my code looks like this:
# building my preprocessor
numerical_transformer = SimpleImputer(strategy='constant')
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='error', drop='if_binary'))
])
my_preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# fyi, the structure of my pipeline and training the model
my_pipeline = Pipeline(steps=[
('preprocessor', my_preprocessor),
('model', my_model)
])
my_pipeline.fit(X_train, y_train)
# preprocessing
my_preprocessor.fit_transform(X_train)
X_valid_transformed = my_preprocessor.transform(X_valid)
# building a valid DataFrame for pdpbox
feature_names = get_column_names_from_ColumnTransformer(my_preprocessor)
X_valid_transformed_pd = pd.DataFrame(X_valid_transformed, columns=feature_names)
# some pdpbox action
pdp_goals = pdp.pdp_isolate(
model=my_model,
dataset=X_valid_transformed_pd,
model_features=feature_names,
feature='Fare'
)
pdp.pdp_plot(pdp_goals, 'Fare')
I hope this can help !