SDV
SDV copied to clipboard
PAR Diagnostic is not 1.0 for datetime context columns
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 1.12.0
- Python version: 3.12
- Operating System: Linux
Error Description
As originally described by @Ng-ms in #2004: When there was a datetime context column, the min/max bounds for the synthesized data were outside the observed range from the real data. This is causing the BoundaryAdherence score to be <1.0 for that context column.
Steps to reproduce
Note that the dataset is not available for privacy reasons. The SDV team will try to replicate this with SDV demo data.
min_max_scaler = MinMaxScaler()
df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns])
df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce')
df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int)
metadata.set_sequence_index(column_name='visit_date')
synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None)
Diagnostic score output:
For this issue let's just focus on the fact that context column pre_date has a score <1.0. There is a separate issue for the sequence index visit_date.
I'm not able to reproduce this issue using our demo datasets (or even using randomly generated data).
I'll leave this issue open if someone is able to come along and share code to help us reproduce this issue! @ng-ms
Closing this issue off as unable to replicate. If anyone else runs into this, please feel free to reply below with some code to replicate it -- we can always re-open to investigate.