differential-privacy-library icon indicating copy to clipboard operation
differential-privacy-library copied to clipboard

DP Random Forest Classifier failed when applying predict function

Open JoaoRodrigues9 opened this issue 3 years ago • 6 comments

Describe the bug When implementing a differential private Random Forest Classifier then functions predict and predict_proba failed with the following error "ValueError: can only convert an array of size 1 to a Python scalar"

I'm using a dataset with both numeric and categorical columns that were already pre processed and when applied the same process to the Logistic regression classifier it works.

Expected behavior I expected to given a X_test dataset return a 1-array with the model predictions

Screenshots Work fine image

Failed with the error "ValueError: can only convert an array of size 1 to a Python scalar"

image

image

System information (please complete the following information):

  • Windows
  • Python version 3.9.7
  • diffprivlib version or commit number 0.5.0
  • numpy version 1.21.5 / scikit-learn version 1.0.2

JoaoRodrigues9 avatar Feb 09 '22 12:02 JoaoRodrigues9

Hi Joao, Can you add the parameter cat_feature_threshold=1 at initialisation, and re-run the model? If it is still not working, can you please print out clf3.feature_domains_ and post its output here to help us debug the problem?

naoise-h avatar Feb 11 '22 15:02 naoise-h

Hello, I am facing the same problem. Any news about this issue? I could make it work setting cat_feature_threshold=0.

aso000 avatar Mar 28 '22 12:03 aso000

Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown.

If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for.

We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.

naoise-h avatar Mar 30 '22 14:03 naoise-h

Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown.

If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for.

We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.

I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest

clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )

where X_train is a DataFrame with:

X_train.dtypes Out[39]: educational-num int64 relationship category capital-gain int64 dtype: object

and y_train is Screenshot 2022-04-06 at 15 50 31

ppyarpe avatar Apr 06 '22 14:04 ppyarpe

Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown. If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for. We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.

I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest

clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )

Are you getting the same ValueError as above, or a Missing domains for some features in feature_domains error? When you specify feature_domains at initialisation, you need to account for all features, so you also need entries for educational-num and capital-gain. Also, because relationship has 6 categories, you may need to specify cat_feature_threshold=6.

naoise-h avatar Apr 07 '22 10:04 naoise-h

Thank you, I had previously try to defined the feature domain for capital-gain and educational-num, but they were set up as a list (and they are continuous variables). Do you know how to defined them as an interval ?

I also tried to One Hot encode my dataset but had issues with the predict function as in the issues above.

Thank you for your help!

From: Naoise Holohan @.> Date: Thursday, 7 April 2022 at 11:04 To: IBM/differential-privacy-library @.> Cc: Ana Pena @.>, Comment @.> Subject: Re: [IBM/differential-privacy-library] DP Random Forest Classifier failed when applying predict function (Issue #62)

Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown. If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for. We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.

I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest

clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )

Are you getting the same ValueError as above, or a Missing domains for some features in feature_domains error? When you specify feature_domains at initialisation, you need to account for all features, so you also need entries for educational-num and capital-gain. Also, because relationship has 6 categories, you may need to specify cat_feature_threshold=6.

— Reply to this email directly, view it on GitHubhttps://github.com/IBM/differential-privacy-library/issues/62#issuecomment-1091471742, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARDPO3IOGV742H4QV2N66DDVD2XF7ANCNFSM5N5LRJ6A. You are receiving this because you commented.Message ID: @.***>

ppyarpe avatar Apr 07 '22 13:04 ppyarpe

Hello everyone,

We have been re-engineering the implementation of RandomForestClassifier in #70 which should address the issues you have described here. The updated model requires a numerical array to be trained on, just like scikit-learn.

I woud encourage you to give the update a go to see if it is has resolved your issues.

naoise-h avatar Aug 30 '22 11:08 naoise-h