malli
malli copied to clipboard
Adding frequencies to `:or`-like schemas for generation
The kind of PR I am suggesting would solve two issues at once. Open to debate of course, but we could have something like this:
;; Variant A
[:orn {:gen/freq {::int 2 ::keyword 1}}
[::int :int]
[::keyword :keyword]]
;; Variant B
[:orn
[::int :int 2]
[::keyword :keyword 1]]
First, the goal is to generate an output that would be more realistic by following frequencies. In this example, I assume that a realistic output would be of having ints twice as often as keywords.
Second, that can act a mechanism for preventing exponential explosion when recursive definitions are at stake. Suppose this very simple use case. Real ones I routinely encounter are much more complex and this problem is very quickly blown out of proportion:
(malli.gen/generate [:schema
{:registry {::data [:or ::int [:ref ::vec]]
::int :int
::vec [:vector ::data]}}
::vec]
{:size 3})
See how small the size must be. Try with 10 and see how large the output becomes on average. It is easy to go a bit further, adding more recursive definitions, and then, even a size of 1 becomes problematic. Instead, I could specify frequencies in ::data so that ::vec happens less often and that would be a very effective solution in order to prevent that kind of exponential explosion.
Implementation-wise, I believe it would be fairly simple. As simple as writing a simple -orn-gen function in malli.generator and slightly modifying -multi-gen. I am targeting :orn because we need a way for referring to children by name.
If you approve the idea and we settle on a notation, I can PR it. I favor variant A. Albeit it forces a bit of extra typing, it is absolutely clear that this is about generation and not validation.
Having such a :gen/freq property would mean using test.check's frequency generator instead of one-of. Default frequency would be 1.
I think the separate top-level property can get too easily out of sync with the implementation, I would push the properties to children or entries:
anonymous -> children:
[:or
[:int {:gen/frequency 1}]
[:keyword {:gen/frequency 2}]]
named -> entries (child props could be used to, but entry would be used if exists):
[:orn
[::int {:gen/frequency 1} :int]
[::keyword {:gen/frequency 2} :keyword 1]]
Could be :gen/freq too, don't have an option on that.
What do you think? PR most welcome.
It is clearer but the downside is that is implies checking all children even when :gen/freq is not used, as opposed to checking just one property on the "parent". In some :multi I sometimes have 100-150 cases these days, so it is not always a trivial cost to pay when you are not using it.
I don't have strong opinions on this but another thing worth considering is the interaction with things like mu/merge and whether it is useful (or bad) to have the frequencies from a super-schema be selected and the ergonomics around overwriting.