SDMetrics icon indicating copy to clipboard operation
SDMetrics copied to clipboard

Sequence length distribution metric

Open ashafquat-mdsol opened this issue 3 years ago • 2 comments

  • Where sequence length is defined as the number of rows in the entity column per ID for single-table data. For multi-table this is defined as the sum of number of rows across entity columns across the tables per ID.
  • Distance in the distribution of event sequence length between original and simulated can then be calculated using KS test

ashafquat-mdsol avatar Aug 31 '22 16:08 ashafquat-mdsol

Thanks for filing @ashafquat-mdsol. Couple thoughts:

Scope

The SDV ecosystem right now does not support multi-table sequential data. I think it is best we focus on the single table case.

The SDV does support cases where you have a single entity column or multiple entity columns. If you have multiple columns, then every unique combination of the columns identifies an entity -- for example a pair like (first_name, last_name).

API & Naming

We are moving towards more descriptive names of metrics. I suggest calling this one SequenceLengthSimilarity, if that works for you.

In terms of the API, we’re piloting usage where the base version of the metric only takes in the most minimal possible data. I think this metric only needs to accept the column (or set of columns) that describe the entity. The # of repetitions of a value determines the length of that sequence.

For consistency's sake, maybe these can be passed in as a pandas.DataFrame. It doesn't seem like any other parameters are needed.

# example, if there is a single entity column called 'patient_id'
SequenceLengthSimilarity.compute(
  real_data[['patient_id']],
  synthetic_data[['patient_id']])

# example, if there are multiple entity columns: 'first_name' and 'last_name'
# together, they determine the entity
SequenceLengthSimilarity.compute(
  real_data[['first_name', 'last_name']],
  synthetic_data[['first_name', 'last_name']])

Code Location

For now, perhaps we can put this in the timeseries module although the team is actively thinking about how to better structure the SDMetrics library. You can inherit from the base class as needed.

Let me know if you have any questions!

npatki avatar Sep 05 '22 18:09 npatki

Created PR #232 to add this test

ashafquat-mdsol avatar Sep 24 '22 01:09 ashafquat-mdsol