SDV
SDV copied to clipboard
Condition on missing values
Problem Description
Let's add the ability to condition on a column's value being missing.
Expected behavior
- Specify a missing value using any of:
None
,np.nan
,''
, ornp.nat
(for datetime values) - Output missing values will always be
np.nan
ornp.nat
# Using a condition object
>>> from sdv.sampling import Condition
>>> my_condition = Condition(column_values={'room_type': np.nan, 'has_rewards': True}, num_rows=250)
# assume the synthesizer has already been trained
>>> synthesizer.sample_from_conditions([my_condition])
# Using a partial dataframe
>>> import pandas as pd
>>> import numpy as np
>>> data_frame = pd.DataFrame(data={
'room_type': ['BASIC', 'BASIC', 'DELUXE', '', None, 'BASIC', 'SUITE'])
>>> model.sample_remaining_columns(data_frame)
Error State
If it's not possible to sample a missing value (ie this is not modeled), provide a user friendly error
>>> my_condition = Condition(column_values={'age': np.nan})
>>> synthesizer.sample_from_conditions([my_condition])
Error: Unexpected value (np.nan) in column 'age'. Missing values were not modeled and cannot be sampled.
Workarounds
Unfortunately, there are no algorithmic workarounds that you can apply for this case. However, you can manually manipulate the data to avoid unexpected errors.
from sdv.sampling import Condition
# remove the np.nan values from the condition entirely (keep other values)
my_condition = Condition(column_values={'has_rewards': True}, num_rows=250)
conditioned_synthetic_data = synthesizer.sample_from_conditions([my_condition])
# add the desired nan values back into the synthetic data
# here we want the entire 'room_type' column to be nan
conditioned_synthetic_data['room_type'] = np.nan