SDV
SDV copied to clipboard
Add support for generated columns when conditional sampling
Problem Description
Every column that the SDV synthesizes falls into 1 of 2 buckets:
- Modeled Columns: The data in these columns are modeled, eg. numerical, datetime, boolean or categorical data
- Generated Columns: The data in these columns are generated from scratch without modeling, etc. primary keys, PII values
Currently, you can't conditionally sample using ID, primary key, or other generated columns.
Expected behavior
As a user, I expect to be able to conditionally sample on any column(s) I see fit.
Additional context
I expect the following code to work:
import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo
data, metadata = download_demo(
modality='single_table',
dataset_name='census_extended'
)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.sample_remaining_columns(data[['id', 'workclass']].head(10))
Related to this issue: https://github.com/sdv-dev/SDV/issues/1096