synth
synth copied to clipboard
Feature: Conditional sampling
Required Functionality It would be great to be able to specify a subset of the schema using conditions while calling generate.
- Ability to define one or multiple conditions:
- Single value: Sets a field value to a fixed value.
- Range: Defines a range that is a subset of the one defined in the schema.
- Would be great to be able to subset following field types: bool, number: range, number: id, string: categorical (Nice to have: one_of variants)
Example: When using the bank example, I would like to be able to specify that I only want samples for transactions with currency USD and GBP.
Proposed Solution I could imagine two solutions:
- Adding a cli parameter and creating a subsetting syntax for each supported field type (e.g.
synth generate bank_db --condition transactions.currency=USD,GBP
- Providing a templating/patching approach (similar to kustomize where you basically validate and merge two JSON schema definitions. The base definition gets partially overridden with the newly defined subset definition.
Use case
- Sampling certain subsets of a complex set of possible value combinations (We use this as a base for value inference using machine learning).
- Should also match this use case from SDV: https://github.com/sdv-dev/SDV/issues/316
Thank you @archnode, this is a great idea!
The templating/patching approach we could make it work by following kustomize's UX (ish). For example, we could add a --customize
flag to synth generate
that takes a kind of customization.json
for the schema:
{
"transactions": {
"content": {
"currency": {
"categorical": {
"USD": 1,
"GBP": 1
}
}
}
}
}
Then running
synth generate --customize customization.json examples/bank/bank_db
works as you want and only generates transactions in USD or GBP.
In terms of merge strategy, for subsetting for primitives (string, number, bool), we'll probably be fine with replacing the leaf schema node and recursing otherwise. So for the example above, we'll just replace the "currency"
string node with the new categorical value.
There's code for this already that we currently use for merging in values on import into the schema. It should be fine using it for merging in schema instances.
I am not sure yet how to cover one_of
with this though, but we can address that later, in a second PR!
Looping in @fretz12 who would like to take this one!
hey @archnode @brokad - this sounds like a really useful feature. I wanted to run some proposed solutions by you to see if they make sense. I can see the CLI option being helpful in adhoc overrides, and the schema merging sounds useful if you're "permanently" overriding parts of the schema. What if we supported both?
CLI:
Basically a direct string replacement-
synth generate bank_db --override transactions.currency="{\"transactions\": {\"content\": {\"currency\": {\"categorical\": {\"USD\": 1,\"GBP\": 1}}}}}"
Schema Merge:
Basically brokad's way:
synth generate --override customization.json examples/bank/bank_db
I chose the keyword override
cuz I've dealt with schema overrides before and that seemed to resonate well.
I'd love to hear your thoughts
@fretz12, why not both indeed, that sounds great! :smile:
For the CLI proposal, we could use Find<Content>
which is implemented for each variant node of the schema tree. This is exposed by Namespace::get_s_node
. You'll need the string the user submits to be deserializable into a Content
though. So your example would rather look like:
synth generate bank_db \
--override transactions.content.currency="{\"type\": \"string\", \"categorical\": {\"USD\": 1,\"GBP\": 1}}"
It wouldn't be hard to remove the need to have the "type"
in there by assuming that if it's not specified, then the existing node's type prevails.
We could do this in two PRs or just one, I don't think this will get too big.
+1 for override
@brokad- thanks for the tip! I'll take that into account
Thank you very much for the feedback and the work on this!
Just for completeness: One aspect hat differs for my usecase (and something maybe I could work on in another issue) is that the customization should be more of a verified constraint than an override - I'd like to make sure that e.g. the defined currency is within the bounds of the original definition.