elm-test Fuzzer-Aware Run Counts?

NOTE: This seems clearly out of scope for 4.0.0, but worth discussing for a future release.

I was lamenting to @evancz about how the (arbitrarily chosen) default run count of 100 makes test runs take awhile, and was wondering if a lower default run count might get similarly useful results in practice.

@evancz had a better idea: what if run counts varied based on the fuzzers involved?

For example, if I am using a fuzz bool test, it makes sense to run that exactly twice. There's no point in running it more than that.

If I am running a fuzz2 bool bool test, it makes sense to run 4 times. If I am running a fuzz int test, it maybe makes sense to run 10-20 times (make sure to get 0, 1, -1, a small negative number, a small positive number, a large negative number, a large positive number, maybe a mix of even and odd numbers...) and more for a fuzz2 int int test, and even more still for a fuzz2 float float test.

The point is that Fuzz already incorporates information about shrinking. Why not have it also incorporate information about sensible default run counts?

For example, suppose we had two flavors of fuzzer run count strategies: enumerations and ranges. Enumerations (like Fuzz.bool) say "here is the exact set of things I can be. Try to make all the permutations happen, but if that number exceeds a certain threshold, fall back on a frequency distribution which I will specify." So fuzz bool always runs exactly twice, and fuzz2 bool bool always runs exactly 4 times, but once you start combining a bunch of them using Fuzz.tuple, eventually exact enumeration would be crazy and we fall back on a frequency.

For ranges, like Fuzz.int, run counts could be bucketed. For example, "Do 1 run that is exactly 0. Do another run that is exactly 1. Do another that is exactly -1. Then do another run that is a large, negative, odd number. Another that is a large, positive, odd number." etc - we can enumerate like 10 "interesting" integer values and have it do exactly those ten (with the only variance being the exact large number, rather than having the entire frequency vary) for fuzz int, whereas again, combining several Fuzz.int instances together would fall back on a frequency distribution.

The idea would be to achieve approximately the same quality of coverage (of interesting values) as the status quo, but with far lower run counts so that the tests can complete faster. After all, most of the time we don't care about the run count, we just want it to find edge cases for us.

One question to answer: in this world, how do you specify the "crank the run count up to 10,000 to really exercise this and hunt for edge cases" equivalent? Maybe in this world, the "run for X seconds and tell me about any failures you find" model makes more sense? Or maybe you specify a total run count for all tests, and after working through all the enumerations, it round-robins them until it reaches the grand total?

Lots of possibilities to explore here. Curious to hear peoples' thoughts!

Apr 13 '17 00:04 rtfeldman

This is a really neat idea, and I think my objections are (1) it would make the already-complicated fuzzer code more so, and (2) we don't know what the performance benefit would be. It might be negligible or even negative due to the overhead. How often do tests do something that would require something that would require a frequency fallback?

You'd have to maintain fuzzer state between tests. Fuzzers (when not shrinking) are a Generator a. Maybe they'd become Generator (Lazy.List a). So for int, you'd give it a seed, and it would eagerly create a few constants and then randomly, lazily, create more. You'd probably want to pass the seed and some union type (what kind of int should I generate?) to the function that will lazily create the next item. And, for bool, it ignores the seed, creates a lazy list of true and false, and that's it.

I wonder how this would affect shrinking. It would probably replace the whole "tell me which one I'm doing" thing, which was a hack to begin with. But you'd basically have to manage a lazy list of increasingly complex values, each of which held a lazy list of shrunken values, with alternatives for more or less complex values depending on what causes the test to pass or fail. Sounds like a lot to keep track of. (Although, we could avoid shrinking lists to the empty list every time, since we'd know that it had already been tried. In fact this probably ties generation and shrinking very closely together...)

I'm not sure I have the time to commit to this, but it might be worth spiking a list, bool, and int implementation in order to do a benchmark.

Apr 13 '17 02:04 mgold

These are all great points! I tried comparing --fuzz 20 and --fuzz 100 on a pretty basic suite and the difference was negligible.

Maybe the first step would be trying to evaluate at what point it even makes a difference. 😅

Apr 13 '17 05:04 rtfeldman

@rtfeldman I'm generating 20 node graphs with certain properties for fuzz tests in my graph library. That search space is a lot larger than the average, and the 3 or so tests used on average to find out if there's a bug with 99.999% accuracy might very well miss some bugs in the larger search space.

I'd like a solution like this, that actually estimates how much time it should spend in each test. Testing a Bool -> Bool -> Bool function with 10k inputs is just wrong. If we can set the cutoff point ourselves (or disable it), that would be great too; e.g. when merging into master or testing a release candidate, it could make sense to test all possible inputs over night. Combining that with only running a single test at a time, so we can run all tests in different processes (possibly on different machines), makes it a great feature. You could spin up 300 ec2 instances and let them run for a few minutes, or distribute it among developer machines or jenkins workers.

Apr 13 '17 10:04 drathier

I know that the Erlang QuickCheck (From Quivik) has an option where you can give it a time to run vs a number of iterations. So you could say for a given property "Run this for 10s" or "run this for 1hr". This might be a good option

Apr 16 '17 05:04 zkessin

I was thinking about this myself and am quite happy to see an issue about this already existing.

I think the most value could come from eliminating duplicated tests. So I wouldn't bother doing this for integers, since the probability of the exact values repeating is pretty small, but for union types this could be pretty handy.

Example of what I mean with waste

type RemoteError = NotFound | Interrupted

type RemoteData
    = NotStarted
    | Loading
    | Success String
    | Error RemoteError

For a type like this we want to definitely test the NotStarted, Loading, Error NotFound and Error Interrupted branches as each of these are in some sense exceptional, but there is no point testing any of them more than once. Hence we would want to spend 96 of the test runs checking the success state. One can achieve a similar result with Frequency, but there is a tradeoff there between waste and the possibility of low frequency items not being tested at all.

I think this removes the question of ramp up, since all infinite things would still run up to the limit and all finite things would run to the limit or less.

Jun 02 '17 08:06 gampleman

@gampleman You have a type with a small, finite number of possibilities. If you want to test each value directly, as opposed to a list of them or something, I would just use a unit test:

remoteDataTest =
  describe "remote data tests" <|
    List.map   
    (\remoteData ->
      test (toString remoteData) (\_ ->
        your test here))
    [NotStarted, Loading, Error NotFound, Error Interrupted]

(I haven't run that through the type checker but that's a rough outline.)

Jun 03 '17 16:06 mgold

@mgold I think if you're optimising the speed of your tests by hand than somehow you're doing it wrong.

Jun 03 '17 17:06 gampleman

Gotta disagree - I think test performance optimization is very important. Slow test suites are big productivity costs on a big team, and there will never be a silver bullet that beats hand tuning. 😉

Jun 03 '17 20:06 rtfeldman

I don't think using a unit test for easily-enumerated, non-nested cases is a perf optimization. I think it's semantically correct. Fuzz tests should only be used when you can't possibly or reasonably generate every case. (Also, with Fuzz.frequency there is a vanishingly small chance that one of your cases won't get run in the 100 trials.)

That said, being able to run tests (or the compiler) in under, say, two seconds so it's inline with your workflow is huge. I've worked on other projects where I instinctively check Twitter every time I run the tests and it's awful for productivity and flow.

Jun 09 '17 22:06 mgold

@gampleman I worked with a 6 week test suite before. The small test suite ran every weekend. Local tests took at least 20 minutes, but tested nearly nothing; it was just a very early indication that something went terribly wrong. We had syntax errors that passed the local tests. I get the same awful feeling when elm-test takes 10 seconds to run, and that is why I'm using only. Performance is super important to keeping your flow. I'd like it to complete while I turn my head towards the screen where the compilation/test result is displayed, i.e. within a second.

Jun 10 '17 09:06 drathier

I agree and sorry for overstating. Fast tests are indeed a necessity. That's the first reason why I was talking about a safe way to automatically optimize fuzz tests to begin with.

Jun 10 '17 17:06 gampleman

Apology accepted. The problem with automatically making fuzz tests smart is that it's hard to do. #167 is basically rolling back a perf optimization that had mixed results at best, made the code more complicated, and possibly introduced runtime errors in corner cases.

Jun 15 '17 04:06 mgold

elm-test elm-test copied to clipboard

Fuzzer-Aware Run Counts?

elm-test
elm-test copied to clipboard