smartnoise-sdk icon indicating copy to clipboard operation
smartnoise-sdk copied to clipboard

differential privacy of numpy.histogramdd

Open gergelyacs opened this issue 2 years ago • 1 comments

in mwem.py, np.histogramdd is called from def _histogram_from_data_attributes. As it uses the original data to create bins, shouldn't the identification of bin edges be randomized too in order to guarantee Differential Privacy?

gergelyacs avatar Jun 20 '22 06:06 gergelyacs

Thanks for flagging this. Our intention is for MWEM to only ever receive categorical columns, with the categories provided externally. The way the code is currently written, it's not actually trying to discretize continuous values; it's just treating all columns as if they are integer-coded categorical and giving each category its own bin. IOW, the assumption is that the data are categorical, with categories being like (0, 1, 2, ...). This approach creates risk in at least two ways. First, if the caller has some external category encoding for integers where not all integers map to categories (e.g. the data dictionary specifies categories [1,2, 4]) the histogram we create will have spurious categories. Second, if the caller doesn't realize these are meant to be categories, it could result in some leakage and poor utility.

We are intending to fix both of these in the next month by refactoring the way we pass the data to the synthesizers. You can see the proposal here, please let us know if you have any feedback: https://github.com/opendp/smartnoise-sdk/issues/467

I will keep this issue open until that PR is merged

joshua-oss avatar Jun 28 '22 03:06 joshua-oss

MWEM now uses a LabelTransformer for all categorical columns by default, so all categorical columns are stored in integer indices starting at zero. Continuous values are binned externally by the TableTransformer, using privacy budget to infer approximate bounds if bounds are not provided.

joshua-oss avatar Oct 09 '22 00:10 joshua-oss