smartnoise-sdk
smartnoise-sdk copied to clipboard
ValueError: We don't support continuous values on this synthesizer. Please discretize values.
is there some discretizer to prefer? How do we deal with data after synthetozation (to transform bins into numerical values again with regard to the correlations)?
Note that we will soon reintroduce support for continuous values on the GAN-based synthesizers. However, MWEM will still support only discretized value, and we may have future synthesizers that require discretization.
To group numeric items into bins, np.histogram
works well. However, you will want to take care to ensure that bins don't leak private information. In general, this means that the bin boundaries should be determined in a way that does not require looking at the data. For example, if the data curator or domain expert has determined that the age
column will have values that range only between 0 and 100, you can safely make n
bins by dividing the range by the number of buckets desired, and make n
equal sized bins.
In the case of something like age
, you could plausibly used 100 bins, with each integer age being its own bin. The key is that the min and max value should be provided independent of the data; i.e. without being derived from the actual data. For fields with a much larger range, you can still safely bin, as long as you have a way of obtaining a data-independent min and max. In cases where you don't have a public min/max available, you could compute a differentially private min and max, and then divide that range into equally sized bins. In this case, you will spend some extra epsilon to safely learn the min and max.
There isn't a one-size-fits-all answer to converting bins back into numerical values. Some people treat the binning as a pre-processing step, and just use the binned data output from the synthesizer. If the values in the binned column are uniformly distributed before binning, you can invert them by choosing the midpoint of the bin edges, or by uniformly drawing from between the bin edges. It's relatively uncommon for data to be uniform in this manner, though, so choosing the midpoint will result in some loss of fidelity (for example, if data are gaussian or exponential such as zipf).
You could plausibly improve on the midpoint if you know that data are roughly exponential or gaussian, but in this case you would want to be careful not to leak private information with the binning/reversing procedure. For example, let's say that you are binning an income
column, and we know without looking at the data that incomes typically follow a power law distribution. If you choose a "safe" min and max, and divide the incomes into 30 equally-sized bins, the records will be more heavily concentrated in the bins toward the bottom of the income distribution, and more importantly, the records within each bin will be skewed left. So, showing the midpoint of each bin will cause the final results to be biased upwards. If you had some estimate of the exponent of the distribution, you could instead choose a point in each bin, slightly to the left, which would a good guess for the midpoint of the mass of each bin (presumably the exponent is consistent across bins, so the point chosen would be the same for each bin). The key point is that the exponent is itself a summary statistic, and would need to be computed in a differentially private way (or from public knowledge).
Also note that people will sometimes use non-uniform bin sizes when dealing with exponentially distributed data like income. For example, each bin is double the width of the bin that falls below it. As long as this increase factor (e.g. 2X, 1.23X, etc.) is derived from public knowledge or computed in a privacy-preserving way, this can give bins that are more balanced in membership.
Tagging @AprilXiaoyanLiu re: continuous
This is fixed in #463