SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Error when using a datetime column as a context column with PAR Synthesizer

Open MichaelG-Uke opened this issue 1 year ago • 2 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 1.15
  • Python version: 3.12
  • Operating System: Linux

Error Description

Using datetime objects in a context column results in the following error:

ValueError: Error: Sampling terminated. No results were saved due to unspecified "output_file_path".
could not convert string to float: '2006-01-01'

Steps to reproduce

!pip install sdv==1.15.0

import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import SingleTableMetadata

event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10

start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
context_dates = [(event_start_date + timedelta(days=random.randint(0, (event_end_date - event_start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]

s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]

df = pd.DataFrame(
    {
        "Date": start_dates,
        "s_key": s_key,
        "val": val
    }
)

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name='s_key', sdtype='id')
metadata.set_sequence_key(column_name="s_key")

synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5,context_columns=["Date"])

event_context = pd.DataFrame(data={
    "Date": context_dates
})

synthesizer.fit(df)
synthesizer.sample_sequential_columns(context_columns=event_context)

MichaelG-Uke avatar Aug 15 '24 05:08 MichaelG-Uke

Thanks for raising this @MichaelG-Uke I ran into an error during the synthesizer.fit(df) step itself:

Screenshot 2024-08-15 at 11 38 04 AM

Did you run into your error during fit or during sampling?

srinify avatar Aug 15 '24 15:08 srinify

I reproduced the error internally in this Colab Notebook: https://colab.research.google.com/drive/1SW5WxJgU5Y2ykmP0t793a5OE-LxKsw5H?authuser=1#scrollTo=sHSODwrsjwZ9

srinify avatar Aug 27 '24 15:08 srinify

@srinify the error you are seeing is occurring because the metadata isn't specified correctly. I am able to reproduce the exact error as @MichaelG-Uke when I make sure that the 'val' column is set to numerical. Full replication code is below for the latest version of SDV (1.17.2).

import pandas as pd
import random
from datetime import datetime, timedelta
from sdv.sequential import PARSynthesizer
from sdv.metadata import Metadata

event_start_date = datetime(2024, 1, 1)
event_end_date = datetime(2024, 7, 1)
n = 10

start_dates = [(datetime(2023,9,1)).strftime('%Y-%m-%d') for _ in range(n)]
context_dates = [(event_start_date + timedelta(days=random.randint(0, (event_end_date - event_start_date).days))).strftime('%Y-%m-%d') for _ in range(n)]

s_key = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
val = [51, 53, 54, 55, 56, 12, 13, 14, 15, 16]

df = pd.DataFrame(
    {
        "Date": start_dates,
        "s_key": s_key,
        "val": val
    }
)

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'columns': {
                'Date': { 'sdtype': 'datetime', 'datetime_format': '%Y-%m-%d' },
                's_key': { 'sdtype': 'id' },
                'val': { 'sdtype': 'numerical'}
            },
            'sequence_key': 's_key'
        }
    }
})

synthesizer = PARSynthesizer(metadata, verbose=True, epochs=5,context_columns=["Date"])
synthesizer.fit(df)

event_context = pd.DataFrame(data={"Date": context_dates})
synthesizer.sample_sequential_columns(context_columns=event_context)
ValueError: Error: Sampling terminated. No results were saved due to unspecified "output_file_path".
could not convert string to float: '2024-01-05'

Stack trace: stack_trace.txt

Note that this issue is only happening on sample_sequential_columns. The overall sample call is working without issues. I will update the issue to clarify the title.

npatki avatar Dec 02 '24 16:12 npatki

Note: The datetime issue will be fixed by #2347. However, it will not fully resolve the problem stated in this issue, because there is only 1 context column here that is being modeled. We are currently unable to conditionally sample in this case due to #1096. I will leave a simple workaround in that issue that you'd be able to use in the meantime.

npatki avatar Jan 15 '25 20:01 npatki