synth icon indicating copy to clipboard operation
synth copied to clipboard

Feature: Conditional sampling

Open archnode opened this issue 3 years ago • 6 comments

Required Functionality It would be great to be able to specify a subset of the schema using conditions while calling generate.

  • Ability to define one or multiple conditions:
    • Single value: Sets a field value to a fixed value.
    • Range: Defines a range that is a subset of the one defined in the schema.
  • Would be great to be able to subset following field types: bool, number: range, number: id, string: categorical (Nice to have: one_of variants)

Example: When using the bank example, I would like to be able to specify that I only want samples for transactions with currency USD and GBP.

Proposed Solution I could imagine two solutions:

  • Adding a cli parameter and creating a subsetting syntax for each supported field type (e.g. synth generate bank_db --condition transactions.currency=USD,GBP
  • Providing a templating/patching approach (similar to kustomize where you basically validate and merge two JSON schema definitions. The base definition gets partially overridden with the newly defined subset definition.

Use case

  • Sampling certain subsets of a complex set of possible value combinations (We use this as a base for value inference using machine learning).
  • Should also match this use case from SDV: https://github.com/sdv-dev/SDV/issues/316

archnode avatar Aug 30 '21 14:08 archnode

Thank you @archnode, this is a great idea!

The templating/patching approach we could make it work by following kustomize's UX (ish). For example, we could add a --customize flag to synth generate that takes a kind of customization.json for the schema:

{
  "transactions": {
    "content": {
      "currency": {
        "categorical": {
          "USD": 1,
          "GBP": 1
        }
      }
    }
  }
}

Then running

synth generate --customize customization.json examples/bank/bank_db

works as you want and only generates transactions in USD or GBP.

In terms of merge strategy, for subsetting for primitives (string, number, bool), we'll probably be fine with replacing the leaf schema node and recursing otherwise. So for the example above, we'll just replace the "currency" string node with the new categorical value.

There's code for this already that we currently use for merging in values on import into the schema. It should be fine using it for merging in schema instances.

I am not sure yet how to cover one_of with this though, but we can address that later, in a second PR!

brokad avatar Sep 07 '21 15:09 brokad

Looping in @fretz12 who would like to take this one!

christos-h avatar Sep 10 '21 19:09 christos-h

hey @archnode @brokad - this sounds like a really useful feature. I wanted to run some proposed solutions by you to see if they make sense. I can see the CLI option being helpful in adhoc overrides, and the schema merging sounds useful if you're "permanently" overriding parts of the schema. What if we supported both?

CLI:

Basically a direct string replacement-

synth generate bank_db --override transactions.currency="{\"transactions\": {\"content\": {\"currency\": {\"categorical\": {\"USD\": 1,\"GBP\": 1}}}}}"

Schema Merge:

Basically brokad's way:

synth generate --override customization.json examples/bank/bank_db

I chose the keyword override cuz I've dealt with schema overrides before and that seemed to resonate well.

I'd love to hear your thoughts

fretz12 avatar Sep 12 '21 06:09 fretz12

@fretz12, why not both indeed, that sounds great! :smile:

For the CLI proposal, we could use Find<Content> which is implemented for each variant node of the schema tree. This is exposed by Namespace::get_s_node. You'll need the string the user submits to be deserializable into a Content though. So your example would rather look like:

synth generate bank_db \
      --override transactions.content.currency="{\"type\": \"string\", \"categorical\": {\"USD\": 1,\"GBP\": 1}}"

It wouldn't be hard to remove the need to have the "type" in there by assuming that if it's not specified, then the existing node's type prevails.

We could do this in two PRs or just one, I don't think this will get too big.

+1 for override

brokad avatar Sep 13 '21 10:09 brokad

@brokad- thanks for the tip! I'll take that into account

fretz12 avatar Sep 14 '21 06:09 fretz12

Thank you very much for the feedback and the work on this!

Just for completeness: One aspect hat differs for my usecase (and something maybe I could work on in another issue) is that the customization should be more of a verified constraint than an override - I'd like to make sure that e.g. the defined currency is within the bounds of the original definition.

archnode avatar Sep 21 '21 10:09 archnode