DiCE
DiCE copied to clipboard
Categorical have to be strings
When using a dataset with categorical data, if some of these data are not strings, then the following line will produce a bug.
# generate counterfactuals
dice_exp_genetic = exp_genetic.generate_counterfactuals(query_instances,
total_CFs=4, desired_class=desired_class)
ValueErrorTraceback (most recent call last)
<ipython-input-28-011c74faad5b> in <module>
1 # generate counterfactuals
2 dice_exp_genetic = exp_genetic.generate_counterfactuals(query_instances,
----> 3 total_CFs=4, desired_class=desired_class)
~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/explainer_base.py in generate_counterfactuals(self, query_instances, total_CFs, desired_class, desired_range, permitted_range, features_to_vary, stopping_threshold, posthoc_sparsity_param, posthoc_sparsity_algorithm, verbose, **kwargs)
100 posthoc_sparsity_algorithm=posthoc_sparsity_algorithm,
101 verbose=verbose,
--> 102 **kwargs)
103 cf_examples_arr.append(res)
104 return CounterfactualExplanations(cf_examples_list=cf_examples_arr)
~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/dice_genetic.py in _generate_counterfactuals(self, query_instance, total_CFs, initialization, desired_range, desired_class, proximity_weight, sparsity_weight, diversity_weight, categorical_penalty, algorithm, features_to_vary, permitted_range, yloss_type, diversity_loss_type, feature_weights, stopping_threshold, posthoc_sparsity_param, posthoc_sparsity_algorithm, maxiterations, thresh, verbose)
269 query_instance_orig = query_instance
270 query_instance = self.data_interface.prepare_query_instance(query_instance=query_instance)
--> 271 query_instance = self.label_encode(query_instance)
272 query_instance = np.array(query_instance.values[0])
273 self.x1 = query_instance
~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/dice_genetic.py in label_encode(self, input_instance)
524 def label_encode(self, input_instance):
525 for column in self.data_interface.categorical_feature_names:
--> 526 input_instance[column] = self.labelencoder[column].transform(input_instance[column])
527 return input_instance
528
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in transform(self, y)
275 return np.array([])
276
--> 277 _, y = _encode(y, uniques=self.classes_, encode=True)
278 return y
279
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
120 else:
121 return _encode_numpy(values, uniques, encode,
--> 122 check_unknown=check_unknown)
123
124
/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in _encode_numpy(values, uniques, encode, check_unknown)
49 if diff:
50 raise ValueError("y contains previously unseen labels: %s"
---> 51 % str(diff))
52 encoded = np.searchsorted(uniques, values)
53 return uniques, encoded
ValueError: y contains previously unseen labels: [0]
The solution is then to convert all the data to strings with what follows.
for c in lst:
df[c] = df[c].astype(str)
One can simply test this bug with the jupyter notebook: https://github.com/interpretml/DiCE/blob/master/docs/source/notebooks/DiCE_model_agnostic_CFs.ipynb by replacing the binary feature gender by integers:
gender = dataset['gender'].to_numpy()
gender[gender=='Male'] = '0'
gender[gender=='Female'] = '1'
dataset['gender'] = gender.astype(int)
dataset.head()
Thanks, the reason for the error is due to use of labelencoder from sklearn that expects a string. Having categoricals as numeric values is possible, but raises the risk of confusion in case a user does not explicitly provide the data type (and wanted it to be treated as a numerical column).
Therefore, it might be safer to pre-processes the categorical columns to be non-numeric, before passing to DiCE. That said, sometimes categorical variables can be integers. Will look to support this in a future release.
Hi!!
I got the same error, when running the function exp_genetic.generate_counterfactuals However, when I use exp_random.generate_counterfactuals, I don't get this error. Can you explain, why this error is only raised for the function exp_random.generate_counterfactuals?
Furthermore, I was trying to fix the error with the comments @amit-sharma and @londumas, but still didn't succeed in running the function. Can you possibly provide a more detailed solution?
Thank you!!