SDV
SDV copied to clipboard
If there are many `NaNs`, conditional sampling sometimes fails (`GaussianCopulaSynthesizer`)
Environment Details
- SDV version: 1.2.1 (latest)
Error Description
Sometimes conditional sampling will fail with the GaussianCopulaSynthesizer, even though this is supposed to be very efficient. This occurs even if I remove all the constraints.
According to the docs, the Gaussian Copula is supposed to conditional sample directly by updating its mathematical formulas. There shouldn't be a reason why it would reject the created samples.
Root Cause
We identified the root cause to be the way we handle NaN values in numerical/datetime columns. If NaNs are possible in the conditioned columns and those columns are numerical/datetime, then there's a chance some sampled rows would get rejected.
- We forward transform the conditioned columns to fully numerical ones
- The Copula algorithm conditionally samples the remaining columns (directly using mathematical formulas). (Note that this step performs a matrix inversion, which can occasionally fail. Our exploration shows that it is not the leading cause of failure.)
- We reverse transform the output back into the original space
- We reject any rows that don't match the original conditions
The issues are in step 3 and 4:
- Step 3: The reverse transform can randomly add NaNs back in. (See FloatFormatter.) In this case, the conditioned column would no longer match the input.
- Step 4: If there is a NaN value in the conditioned columns, then this comparison fails during reject sampling. So 0 rows would be returned. At the very least, we should be able to fix this step easily.
Note that currently, we do not support passing in NaN conditions though there is an existing feature request for it in #695.
Workaround
Until we can accommodate this feature request, a workaround would be to update the preprocessing of your numerical/datetime columns -- especially the ones that you want to conditionally sample on later. For example, assume your real data has many columns and you like to conditionally sample specific ages and start dates in the synthetic data.
age | start_date | amt | gender | browser |
---|---|---|---|---|
32 | 2023-02-03 | 15.60 | F | Chrome |
46 | NaN | 90.00 | M | Firefox |
NaN | 2023-10-30 | 18.21 | M | Safari |
Then you'd do the following when training your synthesizer:
from sdv.single_table import GaussianCopulaSynthesizer
from rdt.transformers.numerical import FloatFormatter
from rdt.transformers.datetime import UnixTimestampEncoder
# create a synthesizer as usual with your metadata
synth = GaussianCopulaSynthesizer(metadata)
# update the preprocessing for numerical and datetime columns
# to have missing_value_generation='from_column' (instead of random)
synth.auto_assign_transformers(data)
synth.update_transformers({
'age': FloatFormatter(missing_value_generation='from_column', learn_rounding_scheme=True, enforce_min_max_values=True, ),
'start_date': UnixTimestampEncoder(missing_value_generation='from_column', enforce_min_max_values=True, datetime_format='%Y-%m-%d')
})
# now continue fitting your synthesizer
synth.fit(data)
Now, conditional sampling from these columns should be easier.
from sdv.sampling import Condition
my_condition = Condition(
num_rows=250,
column_values={'age': '30', 'start_date': '2023-01-01'}
)
synth.sample_from_conditions([my_condition])
Note that your conditions cannot have NaN values (this feature is not currently supported).
Does this phenomenon also apply to CTGANSynthesizer?
Hi @limhasic this issue is only related to the GaussianCopulaSynthesizer because it is supposed to be very efficient at conditional sampling.
Any other type of synthesizer -- including CTGANSynthesizer -- it not as efficient for conditional sampling to begin with, and there is no promise of being able to deliver on all the conditions. If your project relies on conditional sampling, I'd strongly recommend using GaussianCopulaSynthesizer.
For more info, please check out our Troubleshooting page for conditional sampling. Thanks.