SDV PAR can't fit if Range constraint includes a `sequence

Environment Details

SDV version: 1.15.0 (Latest)

Error Description

If you try to fit a PARSynthesizer model with a Range constraint that includes a sequence_index column in the logic, you will get a KeyError.

Steps to reproduce

!pip install sdv==1.15.0

import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10

start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
event_dates = [(start_date + timedelta(days=random.randint(0, (end_date - start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]
end_dates = [(datetime(2025,1,1)).strftime('%Y-%m-%d') for _ in range(n)]

s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]

df = pd.DataFrame(
    {
        "FirstDate": start_dates,
        "LatestDate": end_dates,
        "EventDate": random_dates,
        "s_key": s_key,
        "val": val
    }
)

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='s_key', sdtype='id')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")

synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5)

master_date_constraint = {
    'constraint_class': 'Range',
    'constraint_parameters': {
        'low_column_name': 'FirstDate',
        'middle_column_name': 'EventDate',
        'high_column_name': 'LatestDate',
        'strict_boundaries': False
    }
}

synthesizer.add_constraints(constraints=[master_date_constraint])

synthesizer.fit(df)

Error:

Colab Notebook to Reproduce

Colab Link

Aug 12 '24 16:08 srinify

Workaround

If you have 3 datetime columns (e.g. FirstDate, EventDate, LatestDate) that you want to use in your Range constraint (so that synthetic EventDate values are between the other 2 columns), you can instead create date diff columns to replace FirstDate and LatestDate and model those directly in the SDV without using constraints at all.

Here's some example code that computes date diff columns:

# To replicate my sample data, use first half of the code in the issue body above

# Compute date diff columns, one for the lower bound and one for the upper bound
df['EventDate'] = pd.to_datetime(df['EventDate'])
df['LowerDiff']  = (pd.to_datetime(df['FirstDate']) - pd.to_datetime(df['EventDate'])).dt.days
df['UpperDiff']  = (pd.to_datetime(df['LatestDate']) - pd.to_datetime(df['EventDate'])).dt.days

# Make sure these columns are tagged as numerical in metadata
metadata.update_column(column_name='s_key', sdtype='id') # Sequence Key column
metadata.update_column(column_name='LowerDiff', sdtype='numerical')
metadata.update_column(column_name='UpperDiff', sdtype='numerical')
metadata.set_sequence_index(column_name="EventDate")
metadata.set_sequence_key(column_name="s_key")

synthesizer = PARSynthesizer(metadata2, verbose=True, epochs=5)
synthesizer.fit(df)

synthetic_data = synthesizer.sample(10)

# Cast to datetime if you prefer to keep EventDate as an Object / String dtype column
synthetic_data['FirstDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['LowerDiff'], unit='D')

synthetic_data['LatestDate'] = pd.to_datetime(synthetic_data['EventDate']) + pd.to_timedelta(synthetic_data['UpperDiff'], unit='D')

Aug 13 '24 20:08 srinify

Hi, thanks for the workaround! I use a similar way, I model the actual value in a [0,1] range, storing the lower and upper bounds separately.

Aug 14 '24 07:08 MichaelG-Uke

PAR can't fit if Range constraint includes a `sequence_index` column

Environment Details

Error Description

Steps to reproduce

Workaround