Allow me to control randomization when using the `DCRBaselineProtection` metric

Open npatki opened this issue 10 months ago • 0 comments

Problem Description

The DCRBaselineProtection metric measures the privacy of synthetic data by comparing it against random data. The random data is created by uniformly sampling in the real data's space.

Since randomness is involved in this step, the means that the metric can produce different scores even if it is called on the same data. The variation between these scores should not be too high (TBD) but it would be nice to have a way to control the randomness to produce a deterministic score.

Expected behavior

We can fix a seed internally for the creation of random data. By default, the seed will not be fixed, meaning that the metric can product a different score.

However, there can be a hidden, private attribute to the class that, when set, will fix the seed. This is a hidden, private attribute, so nothing is promised to the end-user yet. But in the future, this will easily allow us to add a parameter for it if determinism is required.

# there is no seed by default
>>> DCRBaselineProtection._seed
None 

# you can set the seed
>>> DCRBaselineProtection._seed = 23

# now every time the metric is called, it will produce the same result
>>> score = DCRBaselineProtection.compute(...)
>>> score = DCRBaselineProtection.compute(...)

# you can unset the seed to go back to randomness
>>> DCRBaselineProtection._seed = None

Additional Context

If a seed is set, it should only affect this metric .. not any other metric or report being run.

Mar 13 '25 19:03 npatki