SDMetrics icon indicating copy to clipboard operation
SDMetrics copied to clipboard

KSTestExtended - Fail when data contains PII fields

Open pvk-developer opened this issue 4 years ago • 0 comments

Environment Details

  • SDMetrics version: 0.3.0
  • Python version: Python 3.7
  • Operating System: Pop OS!

Error Description

When attempting to evaluate data that contains PII fields, this fails because the fake data didn't contain a given record.

Steps to reproduce

Using the SDV tabular demo for PII:

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula

data_pii = load_tabular_demo('student_placements_pii')
model = GaussianCopula(
    primary_key='student_id',
    anonymize_fields={
        'address': 'address'
    }
)

model.fit(data_pii)
new_data_pii = model.sample(200)

from sdv.metrics.tabular import KSTestExtended
KSTestExtended.compute(data_pii, new_data_pii)

This will end up producing the following error:

~/.virtualenvs/SDV/lib/python3.7/site-packages/rdt/transformers/categorical.py in _get_value(self, category)
    111             category = np.nan
    112 
--> 113         mean, std = self.intervals[category][2:]
    114 
    115         if self.fuzzy:

KeyError: 'USS Fowler\nFPO AA 99303'

Where the KeyError will change depending on the data that you may have on the real dataset.

Temporal bypass solution

Simply drop all the PII fields that are within the real_data and the synthetic_data in order to evaluate with this metric.

Here is a working solution for this demo:

ks_data_pii = data_pii.drop('address', axis=1)
ks_new_data_pii = new_data_pii.drop('address', axis=1)
KSTestExtended.compute(ks_data_pii, ks_new_data_pii)

pvk-developer avatar Jun 18 '21 10:06 pvk-developer