SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Condition on missing values

Open npatki opened this issue 2 years ago • 1 comments

Problem Description

Let's add the ability to condition on a column's value being missing.

Expected behavior

  • Specify a missing value using any of: None, np.nan, '', or np.nat (for datetime values)
  • Output missing values will always be np.nan or np.nat
# Using a condition object
>>> from sdv.sampling import Condition
>>> my_condition = Condition(column_values={'room_type': np.nan, 'has_rewards': True}, num_rows=250)

# assume the synthesizer has already been trained
>>> synthesizer.sample_from_conditions([my_condition])

# Using a partial dataframe
>>> import pandas as pd
>>> import numpy as np
>>> data_frame = pd.DataFrame(data={
                              'room_type': ['BASIC', 'BASIC', 'DELUXE', '', None, 'BASIC', 'SUITE'])
>>> model.sample_remaining_columns(data_frame)

Error State

If it's not possible to sample a missing value (ie this is not modeled), provide a user friendly error

>>> my_condition = Condition(column_values={'age': np.nan})
>>> synthesizer.sample_from_conditions([my_condition])
Error: Unexpected value (np.nan) in column 'age'. Missing values were not modeled and cannot be sampled.

npatki avatar Jan 27 '22 17:01 npatki

Workarounds

Unfortunately, there are no algorithmic workarounds that you can apply for this case. However, you can manually manipulate the data to avoid unexpected errors.

from sdv.sampling import Condition

# remove the np.nan values from the condition entirely (keep other values)
my_condition = Condition(column_values={'has_rewards': True}, num_rows=250)
conditioned_synthetic_data = synthesizer.sample_from_conditions([my_condition])

# add the desired nan values back into the synthetic data
# here we want the entire 'room_type' column to be nan
conditioned_synthetic_data['room_type'] = np.nan

npatki avatar Jan 25 '24 15:01 npatki