DiCE icon indicating copy to clipboard operation
DiCE copied to clipboard

Cache internal counterfactual calculations?

Open cdeterman opened this issue 3 years ago • 7 comments

I have a large training dataset (>1M). I'm not sure if that impacts the time for generating counterfactuals but when I am trying to generate counterfactuals from an internal xgboost model it is taking a very long time for 10 samples. The dummy code is pretty similar to what I have seen in the example notebooks.

d = dice_d(dataframe=train_df, continuous_features=vars_for_cfs, outcome_name='target')

mdl = load('my/model/path')
m = dice_m(mdl, backend='sklearn')
exp = Dice(d, m)

dice_exp = exp.generate_counterfactuals(input_df, total_CFs=3, desired_class=1, features_to_vary=vars_to_vary)

Can any of this process be cached to make subsequent calls faster?

cdeterman avatar Jan 11 '22 03:01 cdeterman

Hi,

The training data is just required for learning different things about the data. The generate_counterfactuals() is the actual call that generates the counterfactuals. So size of the training data shouldn't impact the long time seen during generation of counterfactuals.

Since each query instance is separate, the counterfactuals for each query instance maybe different. So a caching strategy may not work. Do you have any other caching in mind to make the generation faster?

Regards,

gaugup avatar Jan 12 '22 10:01 gaugup

Is it normal for the generate call to take more than 24 hours for only 10 instances? It just seemed like something must be off for that long of a call.

cdeterman avatar Jan 12 '22 14:01 cdeterman

No it is not normal. It seems like a bug. Are you able to reproduce this consistently?

Regards,

gaugup avatar Jan 13 '22 06:01 gaugup

I have repeatedly done so with my dataset. Would it adding permitted_ranges help speed things perhaps? Any ideas why it would be running so slowly? I have even tried running it with only 2 instances and it just continues to hang.

cdeterman avatar Jan 13 '22 15:01 cdeterman

Some additional details, I am using a previously trained xgboost model. I can call the predict_proba call with no issues on the same input dataset I am trying to use here. Just trying to track down what all possible reasons there are for this to be hanging on the counterfactuals call.

Also, I did try to put in a permitted_ranges object and the call continues to hang. Any additional insight would be sincerely appreciated.

cdeterman avatar Jan 14 '22 01:01 cdeterman

I want to check back if there is anything else I could do to troubleshoot this slow processing. Would it perhaps be related to the ‘sklearn’ backend with xgboost?

cdeterman avatar Jan 17 '22 17:01 cdeterman

I have also tried to reproduce the examples in the notebooks included in this repository. However, when I try to load the 'Adult' dataset I get the following error. So I cannot even confirm if I can run the examples. Are you able to pull the data? Is this expected?

OSError: https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data not found.

cdeterman avatar Jan 21 '22 19:01 cdeterman