synthcity icon indicating copy to clipboard operation
synthcity copied to clipboard

Question about k anonymity metric

Open amad-person opened this issue 2 months ago • 0 comments

I had a question about interpreting synthcity's k anonymity metric for a synthetic dataset.

Consider the following example train dataset:

Age Gender Zip Code Medical Condition
1 25 F 10000 Condition X
... ... ... ... ...
n 30 M 20000 Condition Y

Here, the sensitive feature is Medical Condition.

Suppose a synthetic dataset has k = 1 because there is only one such row in it:

Age Gender Zip Code Medical Condition
1 25 F 10000 Condition Y

Here, the sensitive feature (Condition Y) in the synthetic dataset is not the true one in the train dataset (Condition X). So on observing the synthetic dataset, the adversary won't learn the true value for the sensitive feature. In this case, can we say that a low k value for the synthetic dataset doesn't necessarily imply it has lesser privacy?

Are there any recommended guidelines on interpreting synthcity's k anonymity metric?

Thank you!

amad-person avatar Apr 26 '24 19:04 amad-person