SDMetrics
SDMetrics copied to clipboard
Sequence length distribution metric
- Where sequence length is defined as the number of rows in the entity column per ID for single-table data. For multi-table this is defined as the sum of number of rows across entity columns across the tables per ID.
- Distance in the distribution of event sequence length between original and simulated can then be calculated using KS test
Thanks for filing @ashafquat-mdsol. Couple thoughts:
Scope
The SDV ecosystem right now does not support multi-table sequential data. I think it is best we focus on the single table case.
The SDV does support cases where you have a single entity column or multiple entity columns. If you have multiple columns, then every unique combination of the columns identifies an entity -- for example a pair like (first_name, last_name).
API & Naming
We are moving towards more descriptive names of metrics. I suggest calling this one SequenceLengthSimilarity, if that works for you.
In terms of the API, we’re piloting usage where the base version of the metric only takes in the most minimal possible data. I think this metric only needs to accept the column (or set of columns) that describe the entity. The # of repetitions of a value determines the length of that sequence.
For consistency's sake, maybe these can be passed in as a pandas.DataFrame. It doesn't seem like any other parameters are needed.
# example, if there is a single entity column called 'patient_id'
SequenceLengthSimilarity.compute(
real_data[['patient_id']],
synthetic_data[['patient_id']])
# example, if there are multiple entity columns: 'first_name' and 'last_name'
# together, they determine the entity
SequenceLengthSimilarity.compute(
real_data[['first_name', 'last_name']],
synthetic_data[['first_name', 'last_name']])
Code Location
For now, perhaps we can put this in the timeseries module although the team is actively thinking about how to better structure the SDMetrics library. You can inherit from the base class as needed.
Let me know if you have any questions!
Created PR #232 to add this test