Add support for generated columns when conditional sampling

Open srinify opened this issue 1 year ago • 0 comments

Problem Description

Every column that the SDV synthesizes falls into 1 of 2 buckets:

Modeled Columns: The data in these columns are modeled, eg. numerical, datetime, boolean or categorical data
Generated Columns: The data in these columns are generated from scratch without modeling, etc. primary keys, PII values

Currently, you can't conditionally sample using ID, primary key, or other generated columns.

Expected behavior

As a user, I expect to be able to conditionally sample on any column(s) I see fit.

Additional context

I expect the following code to work:

import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo

data, metadata = download_demo(
    modality='single_table',
    dataset_name='census_extended'
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.sample_remaining_columns(data[['id', 'workclass']].head(10))

Related to this issue: https://github.com/sdv-dev/SDV/issues/1096

May 07 '24 13:05 srinify