SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Support for specifying a range during conditional sampling

Open srinify opened this issue 11 months ago • 3 comments

Inspired by this issue originally: https://github.com/sdv-dev/SDV/issues/1833

After quick discussion with Neha, we're opening this feature request. Currently, you can specify very specific criteria during conditional sampling (weight is 50) but you can't specify a range of values (e.g. weight from 50 to 200).

srinify avatar Mar 06 '24 16:03 srinify

Workaround

For anyone blocked by this, you can use the code snippet below. This code will sample a lot of rows (unbounded) and then filter out afterwards to a specific range.

# TODO: input the conditions you need
COL_NAME = 'my_column_name'
LOW_RANGE = 18.0 # minimum possible value in range
HIGH_RANGE = 100.0 # maximum possible value in range

# Request more rows than you need. Maybe 1,000 if you need 100 true rows.
synthetic_data = synthesizer.sample(1000)

# Filter out rows to within the range
filtered_synthetic_data = synthetic_data[synthetic_data[(synthetic_data[COL_NAME] >= LOW_RANGE) & (synthetic_data[COL_NAME] <= HIGH_RANGE)]

npatki avatar Mar 06 '24 18:03 npatki

Thanks @npatki for sharing the workaround code. Can such conditions be defined even before generating the samples? I think it would be better to have something like generate with conditions (different from generating with constraints) to avoid unnecessary computation time in generating and then filtering based on conditions.

adib0073 avatar Mar 28 '24 15:03 adib0073

Hi @adib0073, unfortunately I cannot think of a good workaround that would allow you to do so right now.

However in the future, when the team adds an actual feature to enable range-based conditional sampling, that is exactly how I envision it working.

npatki avatar Mar 29 '24 20:03 npatki