Error when using a datetime column as a context column with PAR Synthesizer
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 1.15
- Python version: 3.12
- Operating System: Linux
Error Description
Using datetime objects in a context column results in the following error:
ValueError: Error: Sampling terminated. No results were saved due to unspecified "output_file_path".
could not convert string to float: '2006-01-01'
Steps to reproduce
!pip install sdv==1.15.0
import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata
event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10
start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
context_dates = [(event_start_date + timedelta(days=random.randint(0, (event_end_date - event_start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]
s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]
df = pd.DataFrame(
{
"Date": start_dates,
"s_key": s_key,
"val": val
}
)
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='s_key', sdtype='id')
metadata.set_sequence_key(column_name="s_key")
synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5,context_columns=["Date"])
event_context = pd.DataFrame(data={
"Date": context_dates
})
synthesizer.fit(df)
synthesizer.sample_sequential_columns(context_columns=event_context)
Thanks for raising this @MichaelG-Uke I ran into an error during the synthesizer.fit(df) step itself:
Did you run into your error during fit or during sampling?
I reproduced the error internally in this Colab Notebook: https://colab.research.google.com/drive/1SW5WxJgU5Y2ykmP0t793a5OE-LxKsw5H?authuser=1#scrollTo=sHSODwrsjwZ9
@srinify the error you are seeing is occurring because the metadata isn't specified correctly. I am able to reproduce the exact error as @MichaelG-Uke when I make sure that the 'val' column is set to numerical. Full replication code is below for the latest version of SDV (1.17.2).
import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import Metadata
event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10
start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
context_dates = [(event_start_date + timedelta(days=random.randint(0, (event_end_date - event_start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]
s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]
df = pd.DataFrame(
{
"Date": start_dates,
"s_key": s_key,
"val": val
}
)
metadata = Metadata.load_from_dict({
'tables': {
'table': {
'columns': {
'Date': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
's_key': { 'sdtype': 'id' },
'val': { 'sdtype': 'numerical'}
},
'sequence_key': 's_key'
}
}
})
synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5,context_columns=["Date"])
synthesizer.fit(df)
event_context = pd.DataFrame(data={"Date": context_dates})
synthesizer.sample_sequential_columns(context_columns=event_context)
ValueError: Error: Sampling terminated. No results were saved due to unspecified "output_file_path".
could not convert string to float: '2024-01-05'
Stack trace: stack_trace.txt
Note that this issue is only happening on sample_sequential_columns. The overall sample call is working without issues. I will update the issue to clarify the title.
Note: The datetime issue will be fixed by #2347. However, it will not fully resolve the problem stated in this issue, because there is only 1 context column here that is being modeled. We are currently unable to conditionally sample in this case due to #1096. I will leave a simple workaround in that issue that you'd be able to use in the meantime.