PAR can't fit if Range constraint includes a `sequence_index` column
Environment Details
SDV version: 1.15.0 (Latest)
Error Description
If you try to fit a PARSynthesizer model with a Range constraint that includes a sequence_index column in the logic, you will get a KeyError.
Steps to reproduce
!pip install sdv==1.15.0
import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata
event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10
start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
event_dates = [(start_date + timedelta(days=random.randint(0, (end_date - start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]
end_dates = [(datetime(2025,1,1)).strftime('%Y-%m-%d') for _ in range(n)]
s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]
df = pd.DataFrame(
{
"FirstDate": start_dates,
"LatestDate": end_dates,
"EventDate": random_dates,
"s_key": s_key,
"val": val
}
)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='s_key', sdtype='id')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")
synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5)
master_date_constraint = {
'constraint_class': 'Range',
'constraint_parameters': {
'low_column_name': 'FirstDate',
'middle_column_name': 'EventDate',
'high_column_name': 'LatestDate',
'strict_boundaries': False
}
}
synthesizer.add_constraints(constraints=[master_date_constraint])
synthesizer.fit(df)
Error:
Colab Notebook to Reproduce
Workaround
If you have 3 datetime columns (e.g. FirstDate, EventDate, LatestDate) that you want to use in your Range constraint (so that synthetic EventDate values are between the other 2 columns), you can instead create date diff columns to replace FirstDate and LatestDate and model those directly in the SDV without using constraints at all.
Here's some example code that computes date diff columns:
# To replicate my sample data, use first half of the code in the issue body above
# Compute date diff columns, one for the lower bound and one for the upper bound
df['EventDate'] = pd.to_datetime(df['EventDate'])
df['LowerDiff'] = (pd.to_datetime(df['FirstDate']) - pd.to_datetime(df['EventDate'])).dt.days
df['UpperDiff'] = (pd.to_datetime(df['LatestDate']) - pd.to_datetime(df['EventDate'])).dt.days
# Make sure these columns are tagged as numerical in metadata
metadata.update_column(column_name='s_key', sdtype='id') # Sequence Key column
metadata.update_column(column_name='LowerDiff', sdtype='numerical')
metadata.update_column(column_name='UpperDiff', sdtype='numerical')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")
synthesizer = PARSynthesizer(metadata2, verbose=True, epochs=5)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(10)
# Cast to datetime if you prefer to keep EventDate as an Object / String dtype column
synthetic_data['FirstDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['LowerDiff'], unit='D')
synthetic_data['LatestDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['UpperDiff'], unit='D')
Hi, thanks for the workaround! I use a similar way, I model the actual value in a [0,1] range, storing the lower and upper bounds separately.