synthcity
synthcity copied to clipboard
Question about k anonymity metric
I had a question about interpreting synthcity's k anonymity metric for a synthetic dataset.
Consider the following example train dataset:
Age | Gender | Zip Code | Medical Condition | |
---|---|---|---|---|
1 | 25 | F | 10000 | Condition X |
... | ... | ... | ... | ... |
n | 30 | M | 20000 | Condition Y |
Here, the sensitive feature is Medical Condition
.
Suppose a synthetic dataset has k = 1
because there is only one such row in it:
Age | Gender | Zip Code | Medical Condition | |
---|---|---|---|---|
1 | 25 | F | 10000 | Condition Y |
Here, the sensitive feature (Condition Y
) in the synthetic dataset is not the true one in the train dataset (Condition X
). So on observing the synthetic dataset, the adversary won't learn the true value for the sensitive feature. In this case, can we say that a low k value for the synthetic dataset doesn't necessarily imply it has lesser privacy?
Are there any recommended guidelines on interpreting synthcity's k anonymity metric?
Thank you!