SDV If there are many `NaNs`, conditional sampling sometimes fails (`GaussianCopulaSynthesizer`)

Environment Details

SDV version: 1.2.1 (latest)

Error Description

Sometimes conditional sampling will fail with the GaussianCopulaSynthesizer, even though this is supposed to be very efficient. This occurs even if I remove all the constraints.

According to the docs, the Gaussian Copula is supposed to conditional sample directly by updating its mathematical formulas. There shouldn't be a reason why it would reject the created samples.

Root Cause

We identified the root cause to be the way we handle NaN values in numerical/datetime columns. If NaNs are possible in the conditioned columns and those columns are numerical/datetime, then there's a chance some sampled rows would get rejected.

We forward transform the conditioned columns to fully numerical ones
The Copula algorithm conditionally samples the remaining columns (directly using mathematical formulas). (Note that this step performs a matrix inversion, which can occasionally fail. Our exploration shows that it is not the leading cause of failure.)
We reverse transform the output back into the original space
We reject any rows that don't match the original conditions

The issues are in step 3 and 4:

Step 3: The reverse transform can randomly add NaNs back in. (See FloatFormatter.) In this case, the conditioned column would no longer match the input.
Step 4: If there is a NaN value in the conditioned columns, then this comparison fails during reject sampling. So 0 rows would be returned. At the very least, we should be able to fix this step easily.

Note that currently, we do not support passing in NaN conditions though there is an existing feature request for it in #695.

Jul 24 '23 20:07 npatki

Workaround

Until we can accommodate this feature request, a workaround would be to update the preprocessing of your numerical/datetime columns -- especially the ones that you want to conditionally sample on later. For example, assume your real data has many columns and you like to conditionally sample specific ages and start dates in the synthetic data.

age	start_date	amt	gender	browser
32	2023-02-03	15.60	F	Chrome
46	NaN	90.00	M	Firefox
NaN	2023-10-30	18.21	M	Safari

Then you'd do the following when training your synthesizer:

from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.numerical import FloatFormatter
from rdt.transformers.datetime import UnixTimestampEncoder

# create a synthesizer as usual with your metadata
synth = GaussianCopulaSynthesizer(metadata)

# update the preprocessing for numerical and datetime columns
# to have missing_value_generation='from_column' (instead of random)
synth.auto_assign_transformers(data)
synth.update_transformers({
  'age': FloatFormatter(missing_value_generation='from_column', learn_rounding_scheme=True, enforce_min_max_values=True, ),
  'start_date': UnixTimestampEncoder(missing_value_generation='from_column', enforce_min_max_values=True, datetime_format='%Y-%m-%d')
})

# now continue fitting your synthesizer
synth.fit(data)

Now, conditional sampling from these columns should be easier.

from sdv.sampling import Condition

my_condition = Condition(
    num_rows=250,
    column_values={'age': '30', 'start_date': '2023-01-01'}
)

synth.sample_from_conditions([my_condition])

Note that your conditions cannot have NaN values (this feature is not currently supported).

Jan 16 '24 20:01 npatki

Does this phenomenon also apply to CTGANSynthesizer?

May 16 '24 02:05 limhasic

Hi @limhasic this issue is only related to the GaussianCopulaSynthesizer because it is supposed to be very efficient at conditional sampling.

Any other type of synthesizer -- including CTGANSynthesizer -- it not as efficient for conditional sampling to begin with, and there is no promise of being able to deliver on all the conditions. If your project relies on conditional sampling, I'd strongly recommend using GaussianCopulaSynthesizer.

For more info, please check out our Troubleshooting page for conditional sampling. Thanks.

May 21 '24 14:05 npatki

SDV SDV copied to clipboard

If there are many `NaNs`, conditional sampling sometimes fails (`GaussianCopulaSynthesizer`)

Environment Details

Error Description

Root Cause

Workaround

SDV
SDV copied to clipboard