smartnoise-sdk
smartnoise-sdk copied to clipboard
differential privacy of numpy.histogramdd
in mwem.py, np.histogramdd
is called from def _histogram_from_data_attributes
.
As it uses the original data to create bins, shouldn't the identification of bin edges be randomized too in order to guarantee Differential Privacy?
Thanks for flagging this. Our intention is for MWEM to only ever receive categorical columns, with the categories provided externally. The way the code is currently written, it's not actually trying to discretize continuous values; it's just treating all columns as if they are integer-coded categorical and giving each category its own bin. IOW, the assumption is that the data are categorical, with categories being like (0, 1, 2, ...). This approach creates risk in at least two ways. First, if the caller has some external category encoding for integers where not all integers map to categories (e.g. the data dictionary specifies categories [1,2, 4]) the histogram we create will have spurious categories. Second, if the caller doesn't realize these are meant to be categories, it could result in some leakage and poor utility.
We are intending to fix both of these in the next month by refactoring the way we pass the data to the synthesizers. You can see the proposal here, please let us know if you have any feedback: https://github.com/opendp/smartnoise-sdk/issues/467
I will keep this issue open until that PR is merged
MWEM now uses a LabelTransformer
for all categorical columns by default, so all categorical columns are stored in integer indices starting at zero. Continuous values are binned externally by the TableTransformer, using privacy budget to infer approximate bounds if bounds are not provided.