In the synthetic of time-series data, keep the `sequence_key` consistent with the original data.
Problem Description
I want the sequence_key values in the data simulated by PARSynthesizer to be consistent with the original data. Currently, due to SDV requirements, sequence_key is specified as an ID type, and ID types generate random values, which does not meet my needs.
Expected behavior
from datetime import datetime
from datetime import timedelta
import numpy as np
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer
def mock_data():
seq_ids = [900001, 900002, 900003, 900004, 900005, 900006, 900007, 900008, 900009]
start_time = datetime(2023, 10, 28, 4, 15)
data = []
for seq_id in seq_ids:
for i in range(5):
date_time = start_time + timedelta(minutes=15 * i)
value = np.random.uniform(1, 100)
data.append({'seq_id': seq_id, 'datetime': date_time, 'value': value})
df = pd.DataFrame(data)
return df
real_data = mock_data()
print(real_data.head(100))
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column(column_name='seq_id', sdtype='id')
metadata.set_sequence_key('seq_id')
metadata.set_sequence_index('datetime')
start_time = datetime.now()
synthesizer = PARSynthesizer(metadata, verbose=True, epochs=128)
synthesizer.fit(real_data)
end_time = datetime.now()
synthetic_data = synthesizer.sample(num_sequences=9, sequence_length=5)
print(synthetic_data.head(100))
Output:
seq_id datetime value
0 900001 2023-10-28 04:15:00 88.343075
1 900001 2023-10-28 04:30:00 24.453783
2 900001 2023-10-28 04:45:00 66.201311
3 900001 2023-10-28 05:00:00 68.288793
4 900001 2023-10-28 05:15:00 10.555130
5 900002 2023-10-28 04:15:00 80.262662
6 900002 2023-10-28 04:30:00 24.370064
7 900002 2023-10-28 04:45:00 44.250974
8 900002 2023-10-28 05:00:00 64.370600
9 900002 2023-10-28 05:15:00 45.912854
10 900003 2023-10-28 04:15:00 25.695243
11 900003 2023-10-28 04:30:00 36.785977
12 900003 2023-10-28 04:45:00 59.255933
13 900003 2023-10-28 05:00:00 83.631524
14 900003 2023-10-28 05:15:00 34.161183
15 900004 2023-10-28 04:15:00 76.622192
16 900004 2023-10-28 04:30:00 97.635065
17 900004 2023-10-28 04:45:00 50.463373
18 900004 2023-10-28 05:00:00 53.162506
19 900004 2023-10-28 05:15:00 45.024679
20 900005 2023-10-28 04:15:00 55.967175
21 900005 2023-10-28 04:30:00 60.319006
22 900005 2023-10-28 04:45:00 4.969765
23 900005 2023-10-28 05:00:00 86.819081
24 900005 2023-10-28 05:15:00 30.062426
25 900006 2023-10-28 04:15:00 24.383062
26 900006 2023-10-28 04:30:00 73.899661
27 900006 2023-10-28 04:45:00 24.390200
28 900006 2023-10-28 05:00:00 51.452548
29 900006 2023-10-28 05:15:00 15.362983
30 900007 2023-10-28 04:15:00 6.963062
31 900007 2023-10-28 04:30:00 23.488596
32 900007 2023-10-28 04:45:00 71.267673
33 900007 2023-10-28 05:00:00 87.326087
34 900007 2023-10-28 05:15:00 89.441880
35 900008 2023-10-28 04:15:00 5.404834
36 900008 2023-10-28 04:30:00 56.468166
37 900008 2023-10-28 04:45:00 84.345769
38 900008 2023-10-28 05:00:00 62.231551
39 900008 2023-10-28 05:15:00 4.202525
40 900009 2023-10-28 04:15:00 73.981153
41 900009 2023-10-28 04:30:00 96.336560
42 900009 2023-10-28 04:45:00 46.061168
43 900009 2023-10-28 05:00:00 58.442370
44 900009 2023-10-28 05:15:00 74.846215
Loss (-0.589): 100%|██████████| 128/128 [00:00<00:00, 177.78it/s]
100%|██████████| 9/9 [00:00<00:00, 288.40it/s]
seq_id datetime value
0 636572584 2023-10-28 04:15:00 58.651855
1 636572584 2023-10-28 04:30:00 71.247476
2 636572584 2023-10-28 04:45:00 80.382582
3 636572584 2023-10-28 05:00:00 58.768539
4 636572584 2023-10-28 05:15:00 4.202525
5 705351915 2023-10-28 04:15:00 57.333614
6 705351915 2023-10-28 04:30:00 48.996900
7 705351915 2023-10-28 04:45:00 26.126351
8 705351915 2023-10-28 05:00:00 19.670256
9 705351915 2023-10-28 05:15:00 28.732383
10 698301954 2023-10-28 04:15:00 70.868821
11 698301954 2023-10-28 04:30:00 35.946020
12 698301954 2023-10-28 04:45:00 45.832996
13 698301954 2023-10-28 05:00:00 55.259220
14 698301954 2023-10-28 05:15:00 24.919739
15 162314092 2023-10-28 04:15:00 52.038490
16 162314092 2023-10-28 04:30:00 68.241808
17 162314092 2023-10-28 04:45:00 56.938387
18 162314092 2023-10-28 05:00:00 38.176355
19 162314092 2023-10-28 05:15:00 48.828152
20 601353867 2023-10-28 04:15:00 12.429452
21 601353867 2023-10-28 04:30:00 24.435029
22 601353867 2023-10-28 04:45:00 69.493305
23 601353867 2023-10-28 05:00:00 29.973742
24 601353867 2023-10-28 05:15:00 29.952920
25 597864398 2023-10-28 04:15:00 87.737777
26 597864398 2023-10-28 04:30:00 50.875915
27 597864398 2023-10-28 04:45:00 97.635065
28 597864398 2023-10-28 05:00:00 52.373688
29 597864398 2023-10-28 05:15:00 75.213350
30 522040997 2023-10-28 04:15:00 33.702249
31 522040997 2023-10-28 04:30:00 70.472768
32 522040997 2023-10-28 04:45:00 44.026007
33 522040997 2023-10-28 05:00:00 77.789348
34 522040997 2023-10-28 05:15:00 57.519564
35 679899017 2023-10-28 04:15:00 48.479804
36 679899017 2023-10-28 04:30:00 40.817928
37 679899017 2023-10-28 04:45:00 60.656329
38 679899017 2023-10-28 05:00:00 37.039939
39 679899017 2023-10-28 05:15:00 31.768680
40 813610428 2023-10-28 04:15:00 45.273490
41 813610428 2023-10-28 04:30:00 66.579771
42 813610428 2023-10-28 04:45:00 69.789106
43 813610428 2023-10-28 05:00:00 10.598702
44 813610428 2023-10-28 05:15:00 43.095843
This code shows that a simple demo for PARSynthesizer, and seq_id in synthetic_data does't match with real_data
Expectation: provide a solution that ensures the ID values of my synthetic_data and real data are consistent, not just in format but completely identical in value.
Additional context
<Please provide any additional context that may be relevant to the issue here. If none, please remove this section.>
Hi @jalr4ever I'm curious to learn more -- why is generating the same ID values important for your use case?
One challenge is that you may have 5 sequence_key values in your real data but request 500 sequences from the trained synthesizer, which creates an ambiguous situation for what the remaining 495 sequence_key values should be.
If the rough format of the generated sequence_key values are important to you, you can specify a regular expression string in your dtype: https://docs.sdv.dev/sdv/reference/metadata-spec/sdtypes#id This workflow has some guardrails built in because of the ambiguity resulting from a small set of sequence_key values in the real data and a potentially large set requested in the synthetic data (e.g. If you define a regex format that only allows 2 digit numeric values but ask for 1000 sequences, SDV will throw an error).
Hi @jalr4ever, adding to @srinify's comment: SDV's synthetic data is designed to create brand new entities that are not necessarily analogous with any entity of your real data. So each synthetic sequence represents an entirely new entity -- it does not map to any one, analogous real sequence. If the desire is to have a fully complete, 1-to-1 mapping between a real sequence and synthetic sequence (same sequence ID), then I would suggest anonymizing the real data itself rather than create synthetic data.
If you could describe your needs, we'd be happy to guide you to a solution. How are you planning to use the synthetic data, and what do the sequences represent?
Hi, @srinify @npatki. Thank you both for your replies. To put it simply, I need to find out what the sequences in my original data correspond to in the synthetic data. My scenario is as follows: I need to provide a report that can show a comparative chart of the distribution similarities between the sequences in the original data and those in the synthetic data. Therefore, I need an "ID" column that allows for a one-to-one correspondence between the original and synthetic data so that I can calculate the distribution of data corresponding to each sequence ID and create plots from it.
Regarding the 5-495 issue mentioned by @srinify : Actually, I am not interested in synthetic data that exceeds the range of the original data sequence; currently, I will perform a groupby() on the ID column of the original data and then count(), passing it to SDV to generate a specified number of values that match my original data. So for the 5-495 issue, I understand that SDV simply does not have corresponding boundary control at this time. There is a design for non-sequential data with enforce_min_max_values=True, but there is no such design as enforce_max_sequence_num=True for sequential data.
Overall, my requirements can be divided into two aspects: first, support for boundary control like enforce_max_sequence_num=True; second, support for one-to-one mapping of sequence IDs. It would be best to provide control options so that synthetic data corresponds directly to real data. If that's not possible, then please provide a mapping list that maps sequence IDs to original data IDs one-to-one.
Hi @jalr4ever, unfortunately the PARSynthesizer is not designed to ever learn or create an exact 1-to-1 analogous mapping.
To illustrate this, see the example table in our docs page. In this example, each Patient ID is a sequence. The synthetic data is designed to represent brand new patients that do not correspond to any 1 original patient and health-related sequences for each one. It is not designed to recycle the same patients that are already in the real data.
I would love to understand a bit more about your use case. Why is is it needed to have the exact same sequence IDs? What does each sequence ID represent in your data and how are you planning to use the synthetic data after creating it?
If it is a matter of showing a report, we can recommend some different metrics and visualizations that are more attuned to multi-sequence data (where you do not have an analogous 1-to-1 mapping).
Hi @npatki. In fact, we will use this data for machine learning, but how do we assess the reliability of this data? In non-time-series data, there are metrics that can abstract the "Shape" of the data (KSComplement). I would like to print out the "Shape" corresponding to each sequence in temporal data for comparison as well.
Hi @jalr4ever, just out of curiosity: If you are planning to use the data for machine learning, I assume you have a train/validation/test data setup. Is it the case that your validation/test data always has the same sequence IDs as the real data? What about any new data for which you'd want to make a prediction?
As for metrics and visualization:
- Since you already have a machine learning use case in mind, I think the best "metric" here might be to directly measure the ROI. Eg. what is the predictive accuracy before vs. after using synthetic data?
- I would recommend looking into our original PARSynthesizer paper. In section 4.2, we describe a framework called MSAS (Multi-Sequence Aggregate Similarity) that is aimed to capture the exact question that you have. Unfortunately, this metric is not yet available in SDMetrics but we hope to add it soon!
@npatki Hi, thank you for your suggestion. I will take a look at the MSAS metric. Currently, our training is actually focused on individual sequences; we train a prediction model for each sequence and perform test/train data splitting based on the sequence data, which means that the sequence IDs in the data are the same. Therefore, we want to know which original sequence corresponds to the sequences in the synthetic data so that we can understand which original sequence this model represents.
Thank you for your comments @jalr4ever. Very helpful.
In your case, I'm not entirely sure if synthetic data is the right approach, as synthetic data is inherently designed to create brand new sequences belonging to entirely new entities. If the desire to is have only the same sequences, I am thinking perhaps anonymization or noising data would be sufficient (rather than synthetic data)?
May I ask why you are unable to train/test on the real sequences? Is it a matter of privacy, or do you simply not have long enough sequences for the task?
@npatki Yes, our current solution involves anonymization. We implemented this due to privacy concerns when sharing data between departments.
Hi @jalr4ever if you're interested in pure anonymization or perturbations of the existing data, there's a chance that the RDT library may help. It allows you to transform the existing data, and has a few features for anonymization.
If your team ever wants to explore creating brand new sequences (for eg. to test out a variety of diverse scenarios, or scale up your data) we'd gladly help you to explore synthetic data solutions with PARSynthesizer.
For now, I'm closing off the issue, but please feel free to reply if there is more to discuss and I can always re-open. (Alternatively, file a new issue for a new topic.) Thanks.