explainerdashboard Regarding v0.3.8.1: pipeline support

I have two issues with the new pipeline support. Here is ascreenshot of my dashboard:

The first issue is: Why are my one-hot encoded columns not shown as one value - so rather than "enc_obj_PropertyType_EFH, enc_obj_PropertyType_Whg, enc_obj_PropertyType_DHH" I would like t have one shapely values for the explaination of the feature PropertyType.
Why are all feature names like "Property Type" , "Unemployment" ... preceded by their .... I guess pipeline identifier like "features" or "enc_obj" (which is actually the name which I chose for my one hot encoder) ?

Best regards

Apr 27 '22 14:04 nilslacroix

Hi @nilslacroix,

This should be output of the pipeline.get_feature_names_out() method. I'm simply relying on the output of this method existing and giving meaningful feature names. Not all sklearn estimators have this method yet, although I think they are working on it.

You can group the onehotencoded features together with the cats parameter, e.g. cats=['enc_obj_PropertyType'] should work in this case...

Apr 28 '22 14:04 oegedijk

It would be nice to autodetect when feature names are due to an onehotencoding so that I can group them. Would take some work, but in principle should be possible I guess...

Apr 28 '22 14:04 oegedijk

Thanks for the fast reply! It seems that pipeline.get_feature_names_out() always applies a double underscore before the actual feature name, so that would be a way to cut the string for appropiate column names. For onehotencoding, binaryencoding that would be indeed a nice feature. Addiotional it would be possible to call pipeline[:-1].get_feature_names_out() to exclude the estimator or they mingle with column names? I think this is more about the transformers right?

Apr 28 '22 17:04 nilslacroix

Is it maybe possible to rename columns? Or should I just use cats for this as well and do something like cats_notencoded = {"Dailyneeds": ["pow_Dailyneeds"]} ?

There is also a problem with using cats={} for encoded features. Assuming you use a pipeline in Scikit which usally looks like this:

pipeline = Pipeline(steps=[("preprocessor", preprocessor),
                           ("scaler", scaler),
                           ('clf', LinearSVR()),

Naturally any scaling is done after the preprocessing. This mean that the one hot encoded columns likely will be changed to values which are not 0 or 1. This produces an error in ExplainerDashboard, since the expected output are unscaled 0 and 1 values exclusively.

Also regarding the target - for computational reasons my target variable was prepared with a log(x+1) function, so naturally before analysing it, I intended to do the backtransformation np.expm1(x). Is the correct way to this when I declare my explainer?

For example with:

explainer = RegressionExplainer(pipeline2, X_test, np.expm(y_test) , shap="guess", unit="€", n_jobs = multiprocessing.cpu_count()-1)

Apr 28 '22 17:04 nilslacroix

Hmmm, I'm not sure the renaming columns functionality belongs in the explainerdashboard. I guess you could either just extract the transformer and model yourself (e.g.. transformer, model = pipeline[:-1], pipeline[-1] and then do the renaming before you pass it to the explainer.

Or you could monkeypatch the get_feature_names_out() method.

Or maybe come up with a FeatureColumnNameTransformer that you include as the last step in the pipeline?

Apr 28 '22 19:04 oegedijk

As for the np.expm(y_test) trick I don't think this would work? I'm assuming the X and y are from the same distribution as the model was fit on. So if you then transform the y_test before passing it to the explainer, all the metrics will be wrong.

Apr 28 '22 19:04 oegedijk

As for the names, well it depends if you want out of the box functionality with pipelines or not, but surely this can be handled manually.

Regarding target, no you are right this won't work. This is kinda problematic because a lot of regression problems generate better results if you log a skewed target distribution. Don't know if there is a workaround for this.

Scaling is a bit bothersome too, maybe you could remove the abortion and replace it with a warning instead?

Error log was:

File ~\miniconda3\envs\Master_ML\lib\site-packages\explainerdashboard\explainer_methods.py:192, in parse_cats(X, cats, sep)
    188 assert not set(onehot_dict.keys()) & set(all_cols), \
    189      (f"These new cats columns are already in X.columns: {list(set(onehot_dict.keys()) & set(all_cols))}! "
    190         "Please select a different name for your new cats columns!")
    191 for col, count in col_counter.most_common():
--> 192     assert set(X[col].astype(int).unique()).issubset({0,1}), \
    193         f"{col} is not a onehot encoded column (i.e. has values other than 0, 1)!"
    194 onehot_cols = list(onehot_dict.keys())
    195 for col in [col for col in all_cols if col not in col_counter.keys()]:

AssertionError: enc_plz__Postcode_4 is not a onehot encoded column (i.e. has values other than 0, 1)!

Apr 28 '22 19:04 nilslacroix

Hmm thinking about this, this might a bigger problem in general. If you want to have interpretable values in the dashboard you can not really scale then they will be meaningless. So you need to take the scaler object, apply it to all annotations in the plots but still keep the computational part on the level with the "real" scaled values.

Apr 28 '22 20:04 nilslacroix

@nilslacroix can you help me with this problem please?

I am trying to upload external data into the dashboard using explainer.set_x_row_func() and explainer.set_y_func(). Does anyone know how to do this? Below is how to get around it,

Storing data externally. You can for example only store a subset of 10.000 rows in the explainer itself (enough to generate importance and dependence plots), and store the rest of your millions of rows of input data in an external file or database:

with explainer.set_X_row_func() you can set a function that takes an index as argument and returns a single row dataframe with model compatible input data for that index. This function can include a query to a database or fileread. with explainer.set_y_func() you can set a function that takes and index as argument and returns the observed outcome y for that index. with explainer.set_index_list_func() you can set a function that returns a list of available indexes that can be queried. Only gets called upon start of the dashboard. If you have a very large number of indexes and the user is able to look them up elsewhere, you can also replace the index dropdowns with a simple free text field with index_dropdown=False. Only valid indexes (i.e. in the get_index_list() list) get propagated to other components by default, but this can be overriden with index_check=False. Instead of an index_list_func you can also set an explainer.set_index_check_func(func) which should return a bool whether the index exists or not. https://github.com/oegedijk/explainerdashboard#minimizing-memory-usage

But unfortunately, I couldn't figure it out. I intend to create an upload tab and then upload data forthwith but I don't know how, CAN YOU HELP PLEASE?

May 03 '22 21:05 eakande

Hmm thinking about this, this might a bigger problem in general. If you want to have interpretable values in the dashboard you can not really scale then they will be meaningless. So you need to take the scaler object, apply it to all annotations in the plots but still keep the computational part on the level with the "real" scaled values.

Yeah, you could think providing a pair of transform/inverse function to apply, and then apply them to all the right places in the dashboard. It's kind of a similar issue with classifiers where you sometimes want to transform from logit space to probability space and back again.

Anyway, I think it's a bit out of scope (and a lot of work) for the dashboard to support it, but ofcourse would happily accept a PR!

May 04 '22 18:05 oegedijk

Actually if you can use explainerdashboard with pipelines from scikit it should be possbile to make it work/implement since it provides inverse_transform() for all features ( https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.inverse_transform ). I would love to do a PR but I am not really qualified and too inexperienced in python/github to be honest. Would probably just introduce a whole lot of errors :P

May 05 '22 11:05 nilslacroix

Actually if you can use explainerdashboard with pipelines from scikit it should be possbile to make it work/implement since it provides inverse_transform() for all features ( https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline.inverse_transform ). I would love to do a PR but I am not really qualified and too inexperienced in python/github to be honest. Would probably just introduce a whole lot of errors :P

ah, that's cool. Didn't know about that feature actually. In this case it would specifically be about a transformation for y only though right?

May 05 '22 12:05 oegedijk

Hi everyone,

I wanted to share a useful suggestion for those using explainerdashboard with sklearn-learn pipelines. Specifically, I've found that it's possible to extract the output of the preprocessor stage without any issues (as you already discussed here).

To do this, you can use code similar to the following:

feature_names_transformed = list(my_preprocessor.get_feature_names_out())
x_test_transformed = my_preprocessor.transform(x_test)
feature_names = [f.split("__")[1] for f in feature_names_transformed]   # remove cat__  ,  num__
df_x_test_transformer = pd.DataFrame(columns=feature_names_transformed, data=x_test_transformed, index=x_test.index)
df_x_test_transformer.columns = feature_names
explainer = RegressionExplainer(ml_model, df_x_test_transformer, cats=my_categorical_features)

However, it's important to remember to avoid sparse matrices in the preprocessor, particularly when dealing with categorical features. Using sparse matrices can cause explainerdashboard to provide incorrect predictions, which can be a frustrating and time-consuming issue to debug.

e.g. of wrong usage:

if not isinstance(x_transformed, np.ndarray):
    x_transformed = x_transformed.toarray()

To ensure that explainerdashboard works correctly, make sure to use the sparse_threshold=0 parameter when defining your ColumnTransformer, like this:

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, self.numerical_features),
        ("cat", categorical_transformer, self.categorical_features),
    ],
    sparse_threshold=0,  # Force Dense Matrix
)

It's important to note that other preprocessing stages and scenarios may result in different issues, but I hope that sharing my experience with this specific issue can save others time and frustration.

Thank you to @oegedijk for this magnificent library. Your hard work and dedication is making a made a huge impact on explained ML and ethics in AI. Keep up the amazing work!

Apr 18 '23 15:04 angelmtenor

explainerdashboard explainerdashboard copied to clipboard

Regarding v0.3.8.1: pipeline support

explainerdashboard
explainerdashboard copied to clipboard