DiCE icon indicating copy to clipboard operation
DiCE copied to clipboard

Cannot handle categorical features when using PyTorch models

Open junqi-jiang opened this issue 3 years ago • 2 comments

I am using DiCE for PyTorch on heterogeneous datasets, binary classification. I pass in the continuous feature indices and the original dataset for the data interface, and a trained model for the model interface.

The model was trained on the one-hot-encoded version of the dataset, and if OHEd correctly, the number of features changes from 9 to 27.

if I do d.get_encoded_categorical_feature_indexes(), I get 27 features, which is correct in the data interfaces.

But then when I generate counterfactuals, it would say RuntimeError: size mismatch, m1: [1 x 33], m2: [27 x 10] at ../aten/src/TH/generic/THTensorMath.cpp:41, meaning that it one-hot-encode the input instances in a strange way resulting in 33 feature columns.

So I found the problems is in "explainer_interfaces/dice_pytorch.py" line 421, query_instance = self.data_interface.get_ohe_min_max_normalized_data(query_instance).iloc[0].values. I suspect something is wrong here.

Just wanted to check if handling categorical features is supported for PyTorch and Tensorflow models? Cheers!

junqi-jiang avatar Nov 17 '21 07:11 junqi-jiang

@junqi-jiang, could you share some code sample which gave you this error?

Regards,

gaugup avatar Nov 17 '21 23:11 gaugup

@junqi-jiang, could you share some code sample which gave you this error?

Regards,

Sure! I did this:

m = dice_ml.Model(model = model, backend='PYT') d = dice_ml.Data(dataframe=df, continuous_features=cont_feat_names, outcome_name="Risk") exp = dice_ml.Dice(d, m) query = { 'Age':45, 'Sex':1, 'Job':2, 'Housing':0, 'Saving accounts':0, 'Checking account':0, 'Credit amount':7882, 'Duration':42, 'Purpose':4 } cf = exp.generate_counterfactuals(query, total_CFs=2, desired_class='opposite') a = cf.visualize_as_dataframe(show_only_changes=True)

And it would print the following:

RuntimeError: size mismatch, m1: [1 x 33], m2: [27 x 20] at ../aten/src/TH/generic/THTensorMath.cpp:41

My model was a 'torch.nn.modules.container.Sequential' object, trained on one-hot-encoded version of the dataset.

If I do print(d.get_encoded_categorical_feature_indexes()), it would print [[3, 4], [5, 6, 7, 8], [9, 10, 11], [12, 13, 14, 15], [16, 17, 18], [19, 20, 21, 22, 23, 24, 25, 26]], meaning that it has in total 27 features after OHE.

junqi-jiang avatar Nov 19 '21 05:11 junqi-jiang

We have an update in v0.9 for deep learning models that allows you to specify an encoding for categorical features. m = dice_ml.Model(model_path=dice_ml.utils.helpers.get_adult_income_modelpath(), backend='TF2', func="ohe-min-max") Where the "ohe-min-max" is a one-hot encoding method for categorical features.

amit-sharma avatar Oct 20 '22 05:10 amit-sharma