malli icon indicating copy to clipboard operation
malli copied to clipboard

Adding frequencies to `:or`-like schemas for generation

Open helins opened this issue 4 years ago • 3 comments

The kind of PR I am suggesting would solve two issues at once. Open to debate of course, but we could have something like this:

;; Variant A
[:orn {:gen/freq {::int 2 ::keyword 1}}
 [::int :int]
 [::keyword :keyword]]
 
;; Variant B
[:orn
 [::int :int 2]
 [::keyword :keyword 1]]                 

First, the goal is to generate an output that would be more realistic by following frequencies. In this example, I assume that a realistic output would be of having ints twice as often as keywords.

Second, that can act a mechanism for preventing exponential explosion when recursive definitions are at stake. Suppose this very simple use case. Real ones I routinely encounter are much more complex and this problem is very quickly blown out of proportion:

(malli.gen/generate [:schema
                         {:registry {::data [:or ::int [:ref ::vec]]
                                     ::int :int
                                     ::vec [:vector ::data]}}
                         ::vec]
                        {:size 3})

See how small the size must be. Try with 10 and see how large the output becomes on average. It is easy to go a bit further, adding more recursive definitions, and then, even a size of 1 becomes problematic. Instead, I could specify frequencies in ::data so that ::vec happens less often and that would be a very effective solution in order to prevent that kind of exponential explosion.

Implementation-wise, I believe it would be fairly simple. As simple as writing a simple -orn-gen function in malli.generator and slightly modifying -multi-gen. I am targeting :orn because we need a way for referring to children by name.

If you approve the idea and we settle on a notation, I can PR it. I favor variant A. Albeit it forces a bit of extra typing, it is absolutely clear that this is about generation and not validation.

Having such a :gen/freq property would mean using test.check's frequency generator instead of one-of. Default frequency would be 1.

helins avatar May 14 '21 10:05 helins

I think the separate top-level property can get too easily out of sync with the implementation, I would push the properties to children or entries:

anonymous -> children:

[:or
 [:int {:gen/frequency 1}]
 [:keyword {:gen/frequency 2}]]

named -> entries (child props could be used to, but entry would be used if exists):

[:orn
 [::int {:gen/frequency 1} :int]
 [::keyword {:gen/frequency 2} :keyword 1]]

Could be :gen/freq too, don't have an option on that.

What do you think? PR most welcome.

ikitommi avatar May 22 '21 18:05 ikitommi

It is clearer but the downside is that is implies checking all children even when :gen/freq is not used, as opposed to checking just one property on the "parent". In some :multi I sometimes have 100-150 cases these days, so it is not always a trivial cost to pay when you are not using it.

helins avatar May 22 '21 18:05 helins

I don't have strong opinions on this but another thing worth considering is the interaction with things like mu/merge and whether it is useful (or bad) to have the frequencies from a super-schema be selected and the ergonomics around overwriting.

rschmukler avatar May 24 '21 14:05 rschmukler