interpret icon indicating copy to clipboard operation
interpret copied to clipboard

Dealing with Categorical Features.

Open Dola47 opened this issue 4 years ago • 4 comments

Dear All,

Question1: I noticed that you are apply get_dummies to the the categorical features, then splitting them into x_train and x_test.

Afterward, you pass x_train to LIME Explainer, and x_test to LIME explain instance.

According to the Author of LIME. it is stated that the Explainer and the Explain instance should receive label encoded version of the categorical features. However, the predict function should take into account the categorical features as Hot encoded ones. I see that your are not doing that, and you just call the predict_prob of the model without taking the One hot encoded features in account.

Question 2: Moreover, for the ebm you mentioned, that it take care of the categorical features automatically, does it then consider any column in our dataset with string values as a categorical feature?

In my case, my dataset has a lot of columns with strings, but I do not need them to be considered as categorical features or even go in the train or the fit process of the model.

Can you clarify this point to me a little a bit.

Thanks.

Dola47 avatar Mar 25 '20 11:03 Dola47

Hi @Dola47,

Thanks for the questions! Let me answer them one at a time:

I see that your are not doing that, and you just call the predict_prob of the model without taking the One hot encoded features in account.

In our example notebook, both X_train and X_test are already one hot encoded -- see this code at the top of the notebook:

X_enc = pd.get_dummies(X, prefix_sep='.')

seed = 1  
X_train, X_test, y_train, y_test = train_test_split(X_enc, y, test_size=0.20, random_state=seed)

At this point, the classifier is trained on the one-hot encoded data X_train, so its predict_proba call will also expect one hot encoded data. In this way, we are taking the one hot encoded features into account -- the blackbox classifier is expecting an input that's already been one hot encoded.

For LIME, because the model we constructed is taking in one hot encoded features, our setup will return explanations that map to what the model takes in directly, which can be useful in certain situations (ex: debugging). Another approach would be as you suggested, to wrap the one-hot encoding logic inside the predict function of the model, and get explanations that map back to the original data (i.e. one coefficient per categorical feature, instead of several).

Question 2:

does [EBM] then consider any column in our dataset with string values as a categorical feature?

Yes, this is correct. You can check to see what the model infers by looking at the .feature_names and .feature_types property of the model, and modifying them as you see fit.

However, because you don't even want them to go into the train/fit process of the model, we'd recommend dropping them before constructing your training set. For example, if your data was in a dataframe df:


columns_to_keep = ['feature1', 'feature2', 'feature5']
X_train = df[columns_to_keep]
....

ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

Hope this helps, and let us know if you have any further questions!

-InterpretML Team

interpret-ml avatar Mar 26 '20 20:03 interpret-ml

Thanks for your Answer @interpret-ml

However, I would like to clarify it a little a bit.

For the your answer on the first Question, you said

For LIME, because the model we constructed is taking in one hot encoded features, our setup will return explanations that map to what the model takes in directly.

What do you mean exactly with model ( what I got that it is the Random Forest Classifier followed by the PCA, so it is the BlackBox model). Or is it the way you implemented LIME in your Framework?

I will continue by considering the Model as the BlackboxModel. I do not have any problem that the model can accept OneHotEncoded features.

However, in the step where you call the LimeTabular you submit it with the X_train which is currently contain OneHotEncoded features. I actually argue about that, as according to Marco he said:

We use a One-hot encoder, so that the classifier does not take our categorical features as continuous features. We will use this encoder only for the classifier, not for the explainer - and the reason is that the explainer must make sure that a categorical feature only has one value.

He said also:

predict function first transforms the data into the one-hot representation.

So, what I have in mind, that the step where the LimeTabular and the lime.explain_local takes in OneHotEncoded data is not correct, it should take only Labelencoded data. However, for the predict_fn it can either take normally one hot encoded features, or take as well Labelencoded features and internally transform it to one hot encoded features.

For the Second question:

Do we have a certain function defined internally inside EBM that allow processors to be attached to it, so instead of manually dropping the features, just defining what are the processors and which features they should work on, and upon that it will be trained only on the selected features and their processed versions?

Dola47 avatar Mar 30 '20 09:03 Dola47

@interpret-ml Any further response?

Dola47 avatar Apr 30 '20 22:04 Dola47

@interpret-ml ???????

Dola47 avatar May 12 '20 16:05 Dola47