differential-privacy-library
differential-privacy-library copied to clipboard
DP Random Forest Classifier failed when applying predict function
Describe the bug When implementing a differential private Random Forest Classifier then functions predict and predict_proba failed with the following error "ValueError: can only convert an array of size 1 to a Python scalar"
I'm using a dataset with both numeric and categorical columns that were already pre processed and when applied the same process to the Logistic regression classifier it works.
Expected behavior I expected to given a X_test dataset return a 1-array with the model predictions
Screenshots
Work fine
Failed with the error "ValueError: can only convert an array of size 1 to a Python scalar"
System information (please complete the following information):
- Windows
- Python version 3.9.7
- diffprivlib version or commit number 0.5.0
- numpy version 1.21.5 / scikit-learn version 1.0.2
Hi Joao, Can you add the parameter cat_feature_threshold=1
at initialisation, and re-run the model? If it is still not working, can you please print out clf3.feature_domains_
and post its output here to help us debug the problem?
Hello, I am facing the same problem. Any news about this issue? I could make it work setting cat_feature_threshold=0.
Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown.
If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains
parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for.
We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.
Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown.
If you specifically need categorical features to be recognised, a work-around is to specify the
feature_domains
parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for.We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.
I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest
clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )
where X_train is a DataFrame with:
X_train.dtypes Out[39]: educational-num int64 relationship category capital-gain int64 dtype: object
and y_train is
Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown. If you specifically need categorical features to be recognised, a work-around is to specify the
feature_domains
parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for. We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest
clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )
Are you getting the same ValueError as above, or a Missing domains for some features in feature_domains
error? When you specify feature_domains
at initialisation, you need to account for all features, so you also need entries for educational-num
and capital-gain
. Also, because relationship
has 6 categories, you may need to specify cat_feature_threshold=6
.
Thank you, I had previously try to defined the feature domain for capital-gain and educational-num, but they were set up as a list (and they are continuous variables). Do you know how to defined them as an interval ?
I also tried to One Hot encode my dataset but had issues with the predict function as in the issues above.
Thank you for your help!
From: Naoise Holohan @.> Date: Thursday, 7 April 2022 at 11:04 To: IBM/differential-privacy-library @.> Cc: Ana Pena @.>, Comment @.> Subject: Re: [IBM/differential-privacy-library] DP Random Forest Classifier failed when applying predict function (Issue #62)
Hi, the issue here is how categorical variables are treated. If a feature is classed as categorical, but the test and training datasets differ in the categorical values they have for that feature, the above error is thrown. If you specifically need categorical features to be recognised, a work-around is to specify the feature_domains parameter directly at initialisation, based on your knowledge of the dataset. This way you can be sure that all possible categorical values of a particular feature are accounted for. We are hoping to refactor the model to require numerical-only values, in line with how scikit-learn trains random forests.
I am having a similar issue, however the ValueError shows up when I try to fit the following Random Forest
clf = RandomForestClassifier(feature_domains=({ '1': ['Own-child', 'Husband','Wife','Not-in-family','Other-relative','Unmarried']}),cat_feature_threshold=1 )
Are you getting the same ValueError as above, or a Missing domains for some features in feature_domains error? When you specify feature_domains at initialisation, you need to account for all features, so you also need entries for educational-num and capital-gain. Also, because relationship has 6 categories, you may need to specify cat_feature_threshold=6.
— Reply to this email directly, view it on GitHubhttps://github.com/IBM/differential-privacy-library/issues/62#issuecomment-1091471742, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARDPO3IOGV742H4QV2N66DDVD2XF7ANCNFSM5N5LRJ6A. You are receiving this because you commented.Message ID: @.***>
Hello everyone,
We have been re-engineering the implementation of RandomForestClassifier in #70 which should address the issues you have described here. The updated model requires a numerical array to be trained on, just like scikit-learn.
I woud encourage you to give the update a go to see if it is has resolved your issues.