smartnoise-sdk Purpose/effect of sigma, preprocessor_eps and category_epsilon_pct parameters in DP-CTGAN and PATE-CTGAN

Purpose/effect of sigma, preprocessor_eps and category_epsilon_pct parameters in DP-CTGAN and PATE-CTGAN

Open AnSmithD opened this issue 3 years ago • 1 comments

Hey there,

Can someone please tell me what the purpose of the sigma, preprocessor_eps and category_epsilon_pct parameters are in the DP-CTGAN and PATE-CTGAN implementations and what they do respectively? I've been trying for a while but can't find an explanation anywhere.

Many thanks in advance!

Jul 29 '22 08:07 AnSmithD

Tagging @AprilXiaoyanLiu

Aug 17 '22 05:08 joshua-oss

Hello! sigma is the noise multiplier used by opacus, and interacts with the batch size to control how fast the epsilon gets spent. You may want to experiment with these two parameters to get the best convergence. The preprocessor_eps specifies how much epsilon gets spent to preprocess continuous values. If you have only categorical values, you can set this to 0.0. Note that preprocessor_eps is subtracted from the total budget specified in the epsilon parameter, so if you have epsilon == 1.0 and preprocessor_eps == 1.0, there will be no epsilon left over for training. The category_eps_pct is used to estimate the frequencies of categorical variables so that they can be sampled to create conditional vectors (the CT part of CTGAN). This value is a percentage, and is taken from whatever of the original epsilon remains after subtracting the preprocessor epsilon. It defaults to 10%, and is spread across the categorical columns. So, for example, if you have some continuous columns and want to preprocess them, in a table with 5 categorical columns, and you use epsilon == 3.0, with default values for category_epsilon_pct and preprocessor_eps, you will get the following behavior:

epsilon of 1.0 will be used to scale the continuous columns, leaving 2.0 of the original 3.0 remaining
epsilon of 0.2 (10% of 2.0) will be taken from the remaining 2.0, and used to estimate frequency histograms for the 5 columns. This means that epsilon 0.04 will be used to count the bins in each column.
The remaining 1.8 epsilon will be used for training.

Sep 29 '22 02:09 joshua-oss

smartnoise-sdk smartnoise-sdk copied to clipboard

Purpose/effect of sigma, preprocessor_eps and category_epsilon_pct parameters in DP-CTGAN and PATE-CTGAN

smartnoise-sdk
smartnoise-sdk copied to clipboard