malli icon indicating copy to clipboard operation
malli copied to clipboard

On fixing Malli's recursive generation

Open helins opened this issue 3 years ago • 8 comments

For validation, current support for recursive data via :ref seems to be appropriate and satisfactory. The problem is that this design is not fit for generation. Any example, even a trivial one, becomes so impractical that a size of 2 or 3 has to be used. Quite often, it is already too high (and one cannot go much lower than that). For comparison, default size is 30 and defaut maximum size during tests is 200. Yet, Malli is an excellent library and test.check does provide a pretty good solution to what I am about to describe. The hard part is to solve how they can fit together without disturbing (too much) the current API.

In my experience, solving this issue would really push Malli forwards by a lot.

This is what we can call a problem of "exponential explosion". The fact that the size of some recursive data is not controlled as it gets generated. It means that instead of generating recursive data that is somewhat linear to the given size, it is rather somewhat exponential. Let us take this example from the README:

[:schema {:registry {::cons [:maybe [:tuple pos-int? [:ref ::cons]]]}}
  ::cons]

This example can be generated easily but only because size has no influence on it, the recursive case depends on :maybe. More often, the recursive case does depend on the size. Let us add a :vector:

[:schema {:registry {::cons [:maybe [:vector [:tuple pos-int? [:ref ::cons]]]]}}
  ::cons]

Now, with a default size of 30, generated values are often absolutely massive. So you can easily imagine how this is troublesome with any real world example. This is because every time a vector is generated, by chance, any of its element may be another vector generated with the same size as its parent, and so on.

Some cases can be mitigated by #449, by biasing generation towards leaves (non-recursive items) rather than containers (recursive items, collections of leaves). It is an easy fix and it is useful for other use cases anyway. However, this is not enough. It can mitigate our problem, and sometimes recursive generation needs to be biased anyway, but it is not a general solution.

What is missing is a mechanism that generate leaves with the given size but controls and diminishes the given size for containers. Let us take that kind of definition, very common in my experience:

[:schema
 {:registry {::data    [:or
                        ::int
                        ::vector]
             ::int     :int
             ::vector  [:vector
                        [:ref ::data]]}}
 ::data]

With a size of 30, I often need roughly 10 screens to display a typical generated structure. An actual use case would often have a few leaves besides ::int and a few containers besides ::vector. As we see, there is no distinction between the leaf and the container (unless you traverse :or children and notice ::vector cycles back to it, but that is not great for performance). If we had a way for making such a distinction, then we could easily leverage clojure.test.check.generators/recursive-gen: https://github.com/clojure/test.check/blob/test.check-0.10.0/src/main/clojure/clojure/test/check/generators.cljc#L1641

Their solution is really quite good and anyway, I don't think there is any other way to really outsmart it without deeply modifying test.check. I have a bit of a hard time imagining how current :ref alone could leverage this. It looks like it cannot do much on its own, without context. For performance reasons, I guess a new kind of schema for specifying containers and leaves would be a lot easier to implement. On the other hand, the API might become confusing. Really, I have no firm opinion at the moment besides the fact that we should somehow have support for recursive-gen alongside #449.

helins avatar May 18 '21 08:05 helins

I agree that the recursive gen is bad at the moment. I have read the clojure.test.check.generators/recursive-gen and checked out the spec's source trying to understand how to best resolve this. No enlightenments yet, so help welcome.

ikitommi avatar May 22 '21 19:05 ikitommi

I asked on the spec channel just to make sure. Hoping someone will prove me wrong but I think they don't have a solution either.

"Albeit extremely simple, with a pretty standard size of 30, generating a single instance of this example results commonly in 1000-3000 leaves (nested vectors containing that much booleans):"

(s/def ::bool
  boolean?)
(s/def ::vec
  (s/coll-of ::data))
(s/def ::data
  (s/or :bool ::bool
        :vec ::vec))

Spec has *recursion-limit* (or something like that) but it does not solve this exponential explosion (as proven by the above example).

After doing some thinking on recursive-gen, it is actually way harder to blend it into a validation framework than I thought at first. That fine granularity in making a distinction between containers and leaves (+ needing to compose generators on the spot) does not translate intuitively to validation without looking odd, at least from what I have been trying to imagine. I am rather positive that no matter how you put it, you need to control size in a discriminating fashion.

helins avatar May 22 '21 19:05 helins

In the end there was no other option for me but writing my own generators. The same would have happened for Spec as some people suggested. Maybe having recursive data is that one particular area where validation and generation cannot meet easily. maybe not at all. Interesting problem but very hard.

However we don't have to forfeit and close the issue.

helins avatar May 31 '21 18:05 helins

When I wrote the generators in Minimallist, I tried to address the problem of exploding size in recursive structures. My approach was to assign a budget to the whole data to generate, then split that budget more or less randomly as the the generation goes toward the leaves, in places where the model was allowing it (variable collection size, optional map entries, etc ...)

I hope it can help to improve Malli. https://github.com/green-coder/minimallist/blob/all-work-and-no-play/test/minimallist/generator_test.cljc#L478-L511

green-coder avatar Jul 18 '21 17:07 green-coder

In the hope that it would reveal an algorithm, I manually translated some schemas into gen/recursive-gen forms that look vaguely the same as the schemas.

Here's ping-pong from the readme:

        [:schema
         {:registry {::ping [:maybe [:tuple [:= "ping"] [:ref ::pong]]]
                     ::pong [:maybe [:tuple [:= "pong"] [:ref ::ping]]]}}
         ::ping]
        (->> (gen/sample
               (gen/recursive-gen
                 (fn [ping]
                   (gen/tuple (gen/return "ping")
                              (gen/recursive-gen
                                (fn [pong]
                                  (gen/tuple (gen/return "pong")
                                             ping))
                                (gen/return nil))))
                 (gen/return nil))
               100)
             (drop 75))

A weirder definition of ping-pong that disallows top-level nil, pushing disjunctions into the body of the recursive schemas:

        [:schema
         {:registry {::ping [:tuple [:= "ping"] [:maybe [:ref ::pong]]]
                     ::pong [:tuple [:= "pong"] [:maybe [:ref ::ping]]]}}
         ::ping]
        (->> (gen/sample
               (gen/recursive-gen
                 (fn [ping]
                   (gen/tuple (gen/return "ping")
                              (gen/one-of
                                [(gen/return nil)
                                 (gen/recursive-gen
                                   (fn [pong]
                                     (gen/tuple (gen/return "pong")
                                                ping))
                                   (gen/tuple (gen/return "pong")
                                              (gen/return nil)))])))
                 (gen/tuple (gen/return "ping")
                            (gen/return nil)))
               100)
             (drop 75))

@helins' example in the OP:

        [:schema
         {:registry {::data    [:or
                                ::int
                                ::vector]
                     ::int     :int
                     ::vector  [:vector
                                [:ref ::data]]}}
         ::data]
        (->> (gen/sample
               (gen/recursive-gen
                 (fn [data]
                   (gen/vector
                     data))
                 gen/large-integer)
               100)
             (drop 75))

My most promising lead so far is to use the generator case for :ref to eventually communicate with itself to close the loop. It at least seems to pinpoint a place where schema->scalar-schema and schema->container-schema functions are needed.

(defn -ref-gen [schema options]
  (let [ref-name (second (m/form schema))]
    (or 
      ;; already seen this :ref, the generator is handy and we're done
      (get-in options [::rec-gen ref-name])
      ;; otherwise, continue to unroll but save the generator for this ref for later
      (let [dschema (m/deref-all schema)
            ;; equivalent to dschema with all recursive :ref occurrences made unreachable
            scalar-schema (schema->scalar-schema dschema) ;;TODO
            ;; ??
            container-schema (schema->container-schema dschema)] ;;TODO
        (gen/recursive-gen
          (fn [inner]
            (generator container-schema
                       (assoc-in options [::rec-gen ref-name]
                                 inner)))
          (generator scalar-schema options))))))

frenchy64 avatar Aug 08 '21 05:08 frenchy64

The solution in #507 is very promising.

  ;; test sizing
  (into (sorted-map)
        (frequencies
          (map
            (comp count flatten)
            (mg/sample
              [:schema {:registry {::cons [:maybe [:tuple pos-int? [:ref ::cons]]]}}
               ::cons]
              {:size 100000}))))
  ;;=> {0 49963, 2 27335, 3 13062, 4 6334, 5 2464, 6 686, 7 141, 8 14, 9 1}

frenchy64 avatar Aug 13 '21 02:08 frenchy64

  (dotimes [size-pow 7]
    (time
      (prn
        (str "10^" size-pow)
        (into (sorted-map)
              (frequencies
                (map
                  (comp count flatten)
                  (mg/sample
                    [:schema {:registry {::cons [:maybe [:tuple pos-int? [:ref ::cons]]]}}
                     ::cons]
                    {:size (Math/pow 10 size-pow)})))))))
  ; (out) "10^0" {3 1}
  ; (out) "Elapsed time: 2.030981 msecs"
  ; (out) "10^1" {0 6, 2 3, 3 1}
  ; (out) "Elapsed time: 1.525037 msecs"
  ; (out) "10^2" {0 58, 2 28, 3 10, 4 4}
  ; (out) "Elapsed time: 3.935027 msecs"
  ; (out) "10^3" {0 497, 2 275, 3 138, 4 70, 5 17, 6 3}
  ; (out) "Elapsed time: 28.227256 msecs"
  ; (out) "10^4" {0 4968, 2 2789, 3 1338, 4 636, 5 218, 6 40, 7 9, 8 2}
  ; (out) "Elapsed time: 288.117389 msecs"
  ; (out) "10^5" {0 50219, 2 27130, 3 13152, 4 6214, 5 2484, 6 658, 7 121, 8 19, 9 3}
  ; (out) "Elapsed time: 2909.440828 msecs"
  ; (out) "10^6" {0 500465, 2 272651, 3 129733, 4 62264, 5 25107, 6 7724, 7 1759, 8 267, 9 29, 10 1}
  ; (out) "Elapsed time: 28400.642903 msecs"

frenchy64 avatar Aug 13 '21 02:08 frenchy64

Merged develop into #507 as the core has evolved a lot during the performance chase. With that:

(mg/sample [:schema {:registry {::cons [:maybe [:vector [:tuple pos-int? [:ref ::cons]]]]}}
            ::cons])
;([]
; [[2 []]]
; nil
; [[1 nil] [2 []] [1 nil]]
; nil
; [[1 nil] [4 [[1 [[1 []] [2 nil]]] [2 nil]]]]
; nil
; nil
; [[121 nil] [2 nil] [64 nil] [3 [[3 nil]]]]
; nil)

and:

(mg/sample [:schema
            {:registry {::data [:or
                                ::int
                                ::vector]
                        ::int :int
                        ::vector [:vector
                                  [:ref ::data]]}}
            ::data])
;(-1
;  -1
;  [0]
;  [[[]] []]
;  -2
;  [-1 -1 6 -1 [-5]]
;  [[0 -1] 2 0 -1 [-3 -1] -1]
;  [[]]
;  [-1 [-9 [] []] 0 -1 [[] [] [] -7] [-3 -1]]
;  [[-1 []] [[] 0 1 [[] []] []] [] -5 [] [[106 -124] [[] 36]]])

ikitommi avatar Nov 05 '21 05:11 ikitommi

fix merged in master 🎉.

ikitommi avatar Aug 27 '22 16:08 ikitommi