fsharp-hedgehog icon indicating copy to clipboard operation
fsharp-hedgehog copied to clipboard

Observing Test Case Distribution

Open moodmosaic opened this issue 7 years ago • 19 comments

Taken from QuickCheck manual:

It is important to be aware of the distribution of test cases: if test data is not well distributed then conclusions drawn from the test results may be invalid.

Thus, we could consider adding label, classify, and collect, in the Property module.

moodmosaic avatar Oct 14 '16 08:10 moodmosaic

I have never used this kind of thing myself, but I can see the appeal, I usually test generators using printSample and eye-balling the result, perhaps this is better though

jacobstanley avatar Oct 14 '16 10:10 jacobstanley

https://github.com/hedgehogqa/haskell-hedgehog/issues/127#issuecomment-341927624

moodmosaic avatar Nov 04 '17 22:11 moodmosaic

I think this would be a great addition to Hedgehog for the obvious reasons mentioned by @moodmosaic. I miss them (coming from FsCheck) and would certainly use them if there were present.

A related suggestion is to enable tests to fail if the number of tests in certain classifications are above or below a given percentage, or absolute number, or both (and perhaps even to generate more test cases until a sufficient number is met). As the feature is implemented in FsCheck, you have to manually check the output of each test to make sure the distribution is OK.

cmeeren avatar Nov 23 '17 14:11 cmeeren

I would also really love this feature 👍. I've been looking at the at the code (and the haskell package https://github.com/qfpl/tasty-hedgehog-coverage), but as far as I understood what they do is wrap the whole Prop in an outer monad that does some funky stuff that went way over my head (it's too late at night for me to figure out nesting of monads at this point). My current thinking is that is should probably be part of the journal (just like the counter examples), the issue with that is dealing with cases where you have 0 "hits". The registry of possible cases needs to be stored somewhere. Any thoughts?

Alxandr avatar Jul 17 '18 22:07 Alxandr

I haven't looked closely at https://github.com/qfpl/tasty-hedgehog-coverag and what this means for F# Hedgehog, but I've ported some of the original paper and IIRC this kind of thing was fairly easy to write.

moodmosaic avatar Jul 18 '18 05:07 moodmosaic

This is an interesting talk that touches on several issues relevant to this issue (labeling, coverage testing, etc.)

cmeeren avatar Jan 11 '21 14:01 cmeeren

I have come to the realization that this is a fairly important feature to me. Without it, for anything but trivial generators, I often feel like I'm flying a bit blind.

After having thought a bit about it and re-watched the talk I linked to above, I think the following would be the most useful features:

  1. A way to specify the required test coverage and have the test fail if the coverage of any classification is insufficient, after running more tests until the results statistically significant. This is explained starting here in the video.
  2. A way to see generated (and shrunk) examples of the classifications. This is explained starting here in the video.
  3. A way to view the percentages for the classifications.

Unfortunately I don't really have the capacity to really delve into new code-bases at the moment. If the changes turn out to be fairly simple, I may be able to help out if given some help and pointers about where to change what. But given that the "run more tests until statistics are good enough" part of point 1 requires reading and understanding a statistics research paper, I'm not sure it is simple. Though @moodmosaic said above that "this sort of thing" (whatever it was) was fairly simple to write, so here's hoping.

In any case, I wanted to share my thoughts.

cmeeren avatar May 19 '22 13:05 cmeeren

Essentially what we want is to add coverage combinators: cover, classify, label, and collect.


Notes

In the Haskell version:

cover records the number of times a predicate is satisfied and displays the result as a percentage. If the percentage doesn’t meet your threshold then the test fails.

classify works the same as cover but is purely informational and doesn’t have a threshold below which it will fail the test.

label is like classify but doesn’t have a predicate, so it simply tracks the percentage of tests run which hit a certain line of code.

collect is like label but can use sprintf "%A": on its argument to create the label name.

In the early 2016 .NET/F# version:

Lines 453 to 511 in this gist, ported from the original QuickCheck (v1) paper, show a rough/naive implementation of those.

moodmosaic avatar May 19 '22 13:05 moodmosaic

cover records the number of times a predicate is satisfied and displays the result as a percentage. If the percentage doesn’t meet your threshold then the test fails.

According to the video, it also runs more tests as needed to ensure the result is statistically significant.

cmeeren avatar May 19 '22 16:05 cmeeren

Also, not sure what you mean by "combinators". Syntax-wise, this could also take the form of custom keywords for the property CE, right?

property {
  let! myInt = Gen.int32 (Range.exponentialBounded())
  classify "zero" (myInt = 0)
}

Though if the alternative is something like

property {
  let! myInt = Gen.int32 (Range.exponentialBounded())
  do! Gen.classify "zero" (myInt = 0)
}

then I don't really care much either way.

cmeeren avatar May 20 '22 08:05 cmeeren

Agreed 👍 That's what I mean; custom CE keywords are better. Those keywords may use the underlying combinators (to be added) in the Property module.

moodmosaic avatar May 20 '22 08:05 moodmosaic

This is such a nerd snipe for me. All your links to external references and explanation of the Haskell implementation will be very helpful. Now the only question is when I will get to this.

TysonMN avatar May 20 '22 11:05 TysonMN

According to the video, it also runs more tests as needed to ensure the result is statistically significant.

@cmeeren, cover won't run more tests; that's happening because of checkCoverage. The equivalent in hedgehog is

checkCoverage :: Property -> Property
checkCoverage =
  verifiedTermination . withConfidence (10^9)

I'll provide an example of QuickCheck's checkCoverage side-by-side with Hedgehog's API and link it here, in case it helps.

moodmosaic avatar Aug 08 '22 12:08 moodmosaic

I'll provide an example of QuickCheck's checkCoverage side-by-side with Hedgehog's API and link it here, in case it helps.

Great, I would appreciated that!

cmeeren avatar Aug 10 '22 08:08 cmeeren

@cmeeren, a good example of QuickCheck's checkCoverage can be found in this StackOverflow answer by @ploeh.

I ported all properties from that answer in Hedgehog and created a custom test-runner that'll run sequentially both the original QuickCheck properties and the Hedgehog ones.

https://github.com/moodmosaic/coverage-check-example

(Install GHCUp, clone the above repo, cd into the directory, and run it via cabal test --test-show-details=streaming. You might have to run chcp 65001 if the bars aren't rendering nicely.)


@TysonMN, if this is ported in F# Hedgehog, perhaps we can have the same API as in Haskell Hedgehog but if possible we'd rather take the statistical part from QuickCheck instead, if my comment in Haskell Hedgehog isn't resolved soon.

moodmosaic avatar Aug 10 '22 15:08 moodmosaic

The equivalent in hedgehog is

Oh, I see, you were referring to Haskell Hedgehog, not F#. I thought you were referring to F# and that this has somehow been implemented under the radar.

In any case, thanks for the clarification. 🙂

cmeeren avatar Aug 11 '22 06:08 cmeeren

I'm beginning to realise that after I got clued into this capability I use it more and more. It enables me to write simpler properties. If I may, here's a simpler example than the one linked above.

Provoked by Robert C. Martin I was recently doing the Gossiping Bus Drivers kata in Haskell, and I was writing a simple property to verify the image of my System Under Test (SUT), which is this function:

drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b

As you can see, the output is a Maybe value, in this case because there are some (actually quite a few) inputs that will never produce an answer. If, however, it produces an answer, the value should be between 0 and 480 (consult the linked kata description if you're curious as to why that is).

The simplest way I can think of to express this property is this:

testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
  let actual = drive $ fmap getNonEmpty routes
  in checkCoverage $
     cover 75 (isJust actual) "solution exists" $
     all (\i -> 0 <= i && i <= 480) actual

Why use cover and checkCoverage? This is because actual is a Maybe Integer, and all works on any Foldable instance, including Maybe. In other words, the assertion verifies exactly what I stated above that the property should be: If actual holds a value, it should be in that particular interval.

What's the easiest way to pass that test, if, for example, you'd employ the Devil's Advocate?

Just return Nothing.

So I wanted to ensure that the Devil can't do that. How do I do that?

Without cover and checkCoverage I'd typically write complex arrange code to generate only valid input. This does have a tendency to make one repeat the implementation details of the SUT, so I'm always on the lookout for better alternatives.

I find that cover and checkCoverage neatly address that concern. I just tell QuickCheck that I don't really care exactly how it does it, but that I want it to generate 'enough' Maybe cases.

Why 75%? That particular number was just a result of a bit of trial and error. I didn't really care about the particular percentage, just that it was comfortably greater than zero, so that I knew that there would be multiple test cases that cover the Maybe partition.

For this particular test, it makes QuickCheck generate 200 tests in order to pass the 75% requirement.

ploeh avatar Mar 15 '23 21:03 ploeh

To start off with a great reference, you are preferring predicative over constructive data.

I find that cover and checkCoverage neatly address that concern. I just tell QuickCheck that I don't really care exactly how it does it, but that I want it to generate 'enough' Maybe cases. [...] For this particular test, it makes QuickCheck generate 200 tests in order to pass the 75% requirement.

I find it confusing to use cover in that way. Instead, I expect cover 75 to produce enough cases that it is X% confident that your generator produces at least 75% of its test cases as you assert...for some default value of X. It could be (or 99.9996%) as is the practice in the "stubborn" field of physics, or it could be (or 95%) as is the practice in the social sciences (which have a replication crisis). I vaguely recall John Hughes saying which value QuickCheck picked in his talk at Lambda Days 19 talk.

I think a more direct approach to solve this problem would be to use https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Gen.fs#L290-L291

F# Hedgehog continues generating test cases until the desired quantity (100 by default have passed... https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L193-L196

...one has failed... https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L210-L213

...or 100 generated values have been skipped due to predicates passed to Gen.filter returning false: https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L197-L200

So I think the simplest change is for us to add the ability to set the discard number to an optional int. Then you could set it to None and set TestLimit to 150 (which is 75% of 200).

TysonMN avatar Mar 18 '23 12:03 TysonMN

@TysonMN, thank you for challenging my assumptions. As everyone else, I'm vulnerable to the Golden Hammer syndrome.

I've always thought of property filters as something one puts in the beginning of the test, as a sort of preamble. Now that you suggest it, it turns out that there's no reason you can't use it in the assertion step:

testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
  let actual = drive $ fmap getNonEmpty routes
  in isJust actual ==>
     all (\i -> 0 <= i && i <= 480) actual

This is, indeed, simpler! Cool! Thank you.

ploeh avatar Mar 18 '23 20:03 ploeh