SDMetrics
SDMetrics copied to clipboard
Implement MSAS
Problem Description
The current Metrics implemented in SDV do not specifically measure the quality of sequences generated with CPAR.
Expected behavior
MSAS is a metric for sequential data quality, detailed in http://arxiv.org/abs/2207.14406. It should be implemented in SDV.
Thanks for filing @LiFaytheGoblin. We'll keep this open to track as we make progress on it.
Just a note that MSAS refers to our overall algorithm of computing sequential data quality, and works in the following steps:
- Compute a metric for every sequence in the real data to get a distribution X
- Compute the same metric for every sequence in the synthetic data to get a distribution X'
- Use the
KSComplement
test to compare the distributions X and X'
Various metrics can be used in step 1. In the paper we used: length, mean, median, standard deviation and the difference between a row n and some step n+t.
Are there any particular metrics that are more or less important to your use case?
FYI some metrics that will use MSAS are actively being discussed in #198