smartnoise-sdk
smartnoise-sdk copied to clipboard
Proof that DP-GAN and DP-CTGAN are differentially private?
Hi,
Seems like Opacus was tacked onto a non-private CTGAN implementation, but for suspicions I raise in a comment to another Github repo that adapts this code, I am not totally convinced that the solution is end-to-end differentially private.
Is there some explanation or reasoning anywhere for why privacy is still preserved under the conditional vector mechanism and preprocessing?
Hello, great question. We made several extra steps to make the dp-cptgan private in addition to just apply dp-sgd to discriminator part including: we use noisy frequencies in DataSampler, and instead of using GMM for preprocessing step; we also provided three options of dp-preprocessing step for continuous columns (no-change, dp standard scaler, and dp min-max). If you still want to use GMM to do the preprocessing step as what paper does, you should perform that preprocessing step locally and only provided preprocessed data to DP-CTGAN, and converted back to original form with generated synthetic data locally.
Makes sense, thanks for the clarification! I see that taken into account here indeed, so we can compute privacy usage as we measure frequencies of values of discrete columns (which is used to calibrate the random selection of a column + value during training), and then compose this later with training privacy cost.
However, I see that on this line, we're storing all of the indices of the given dataset (e.g. indices of the rows) for which a column takes on a particular value. Then, on this line, we randomly sample rows from the stored row indices (training-by-sampling). Intuitively, we would expect this to affect the privacy amplification calculus, since Opacus assumes a standard uniform sampling scheme to calculate its privacy spent, but here we are:
- sampling constrained to only a subset of the dataset rather than uniformly from the entire dataset -> makes it impossible to have a batch with examples that differ in every column (so sample space is smaller)
- overall, sampling non-uniformly (since smaller subsets are weighted more due to the log factor than would be expected under uniform sampling)
- sampling from a potentially tiny batch size-sized subset, for which it is not obvious to me whether the same privacy accounting for DP-SGD would hold.
However, when we take a look at the privacy accounting portion of the code here and here, it seems that only the privacy expenditure from measuring the column frequencies and performing DP-SGD are factored in. Is this the case, or am I missing something? (my logic went wrong somewhere or Opacus is modified under the hood to take this into account)
EDIT: apologies, as it seems I referenced some older code. That being said, I looked at the newest commit on main
and the same comments still stand
Note that log_frequency
defaults to False, and we throw a privacy warning if the caller asks for log frequencies.
Also note that batches typically include samples that differ in every column (unless I am misunderstanding your comment). The training loop calls sample_condvec
, which uniformly randomly selects from the available columns to determine which column will be the conditional column for each row. From there, a categorical value for the chosen conditional column is selected (using the noisy frequencies). Then, sample_data
samples a row uniformly from the rows that have that value in the randomly-selected categorical column. The distribution of conditional categories in the batch will be uniform, the distribution of values in those conditional categories will be the noisy distribution, and the rows from there will be uniform.
Ah, I see. I still think that one has to prove that such a conditional vector sampling process is differentially private, correct? In other words, instead of a uniform sampling from the dataset, this is non-uniform sampling where the probabilities depend on the values of the examples themselves (which to me seems a bit concerning). Would you not need to go through the DP-SGD proof and rework it?
When log_frequency
is false, the sampler is actually sampling uniformly from the dataset. It's just doing so in a roundabout manner. The steps taken are:
- Select a categorical column uniformly at random. Let's suppose we select the
gender
column. - Select a category value for that column, according to the actual frequencies in the data. Supposing our data is 43% male, then 67% of the time we will select female (assume PUMS, which has two genders). This is equivalent to uniformly selecting a row from the vector of all gender values.
- Now, uniformly randomly select a row where
gender
is equal to the value we randomly chose above.
This is the procedure outlined in the CTGAN paper, which only needs to be done this way if step 2 is non-uniform, for example using log frequencies. When step 2 is using the actual frequencies, it is exactly uniform sampling. Every row in the dataset has the same probability of being sampled for every row in the batch.
To see that this is so, suppose that you have 1 billion rows of data with 52 total dimension combinations (2 genders, 2 marital statuses, and 13 education levels). You can count the frequencies of all dimension combinations and divide by the number of rows to get a probability distribution spread across a 52-simplex. Now, instead of uniformly selecting row indices between 0 and 1 billion, you can draw from this 52-simplex probability vector to get rows that uniformly match the underlying distribution. You can also select one column according to its marginal probability, then the next conditioned on the previous column, and so on, to get the same result. In all of these procedures, each source row has equal probability of being selected.
FWIW, it would be possible to maintain log_frequency on a large enough dataset by undersampling, but we didn't see improvements from using log_frequency, so it didn't seem worth the effort to support it.
Oh, I see. This makes a lot of sense. So this means that sampling in this situation is identical to a non-conditional sampling process? In other words, the fact that there is a conditional vector at all would result in an identical training process to as if this code was not included.
I also see from here that the pre-processing is quite simple under DP: just DP standard scaler for the continuous columns and one-hot for the categorical columns. If I am understanding this correctly, does this mean that the central contributions of CT-GAN---mode-specific normalization and its conditional vector idea---have been excluded from its differentially private version?
That's correct, though we still create the random conditional mask which selects a different categorical value to serve as a "label" for each row in the batch. The technique of randomly choosing categorical values to "condition" the other variables is relatively common, as is the technique of using masks to speed up the computation. As far as I know, both techniques predate the CTGAN paper, so the key innovation in that paper is using the random mask to up-sample rare category values via log-frequency. In our implementation the mask and row are both uniformly selected, so it's accurate to say we are excluding a key part of the CTGAN design.
As you've noticed, we also vary from the non-DP implementation of CTGAN by excluding the use of the GMM preprocessor, which seems to have hurt the performance more than excluding log-frequency did. The synthesizer can still get good results if the analyst has a way to preprocess the continuous values into something that is standard scale. In the case of PUMS (e.g.), we know the min and max for income, and can log transform and do a minmax scale without having to spend epsilon, and this gives OK results with DPCTGAN. We can also spend some budget to learn a DPMinMax or DPStandardScaler, and that works OK so long as datasets are large enough to learn useful summaries to perform the preprocessing. However, the comparison with non-DP is unfavorable with something small like PUMS 1000, because we are comparing a non-noisy GMM against a pre processor trying to learn a differentially private variance on a small number of rows with values of large magnitude. The GMM in the original (non-DP) CTGAN implementation "just works" in a lot of cases where DP requires fiddling.