SDMetrics
SDMetrics copied to clipboard
KSTestExtended - Fail when data contains PII fields
Environment Details
- SDMetrics version: 0.3.0
- Python version: Python 3.7
- Operating System: Pop OS!
Error Description
When attempting to evaluate data that contains PII fields, this fails because the fake data didn't contain a given record.
Steps to reproduce
Using the SDV tabular demo for PII:
from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula
data_pii = load_tabular_demo('student_placements_pii')
model = GaussianCopula(
primary_key='student_id',
anonymize_fields={
'address': 'address'
}
)
model.fit(data_pii)
new_data_pii = model.sample(200)
from sdv.metrics.tabular import KSTestExtended
KSTestExtended.compute(data_pii, new_data_pii)
This will end up producing the following error:
~/.virtualenvs/SDV/lib/python3.7/site-packages/rdt/transformers/categorical.py in _get_value(self, category)
111 category = np.nan
112
--> 113 mean, std = self.intervals[category][2:]
114
115 if self.fuzzy:
KeyError: 'USS Fowler\nFPO AA 99303'
Where the KeyError will change depending on the data that you may have on the real dataset.
Temporal bypass solution
Simply drop all the PII fields that are within the real_data and the synthetic_data in order to evaluate with this metric.
Here is a working solution for this demo:
ks_data_pii = data_pii.drop('address', axis=1)
ks_new_data_pii = new_data_pii.drop('address', axis=1)
KSTestExtended.compute(ks_data_pii, ks_new_data_pii)