fsharp-hedgehog
fsharp-hedgehog copied to clipboard
Observing Test Case Distribution
Taken from QuickCheck manual:
It is important to be aware of the distribution of test cases: if test data is not well distributed then conclusions drawn from the test results may be invalid.
Thus, we could consider adding label
, classify
, and collect
, in the Property
module.
I have never used this kind of thing myself, but I can see the appeal, I usually test generators using printSample
and eye-balling the result, perhaps this is better though
https://github.com/hedgehogqa/haskell-hedgehog/issues/127#issuecomment-341927624
I think this would be a great addition to Hedgehog for the obvious reasons mentioned by @moodmosaic. I miss them (coming from FsCheck) and would certainly use them if there were present.
A related suggestion is to enable tests to fail if the number of tests in certain classifications are above or below a given percentage, or absolute number, or both (and perhaps even to generate more test cases until a sufficient number is met). As the feature is implemented in FsCheck, you have to manually check the output of each test to make sure the distribution is OK.
I would also really love this feature 👍. I've been looking at the at the code (and the haskell package https://github.com/qfpl/tasty-hedgehog-coverage), but as far as I understood what they do is wrap the whole Prop
in an outer monad that does some funky stuff that went way over my head (it's too late at night for me to figure out nesting of monads at this point). My current thinking is that is should probably be part of the journal (just like the counter examples), the issue with that is dealing with cases where you have 0 "hits". The registry of possible cases needs to be stored somewhere. Any thoughts?
I haven't looked closely at https://github.com/qfpl/tasty-hedgehog-coverag and what this means for F# Hedgehog, but I've ported some of the original paper and IIRC this kind of thing was fairly easy to write.
This is an interesting talk that touches on several issues relevant to this issue (labeling, coverage testing, etc.)
I have come to the realization that this is a fairly important feature to me. Without it, for anything but trivial generators, I often feel like I'm flying a bit blind.
After having thought a bit about it and re-watched the talk I linked to above, I think the following would be the most useful features:
- A way to specify the required test coverage and have the test fail if the coverage of any classification is insufficient, after running more tests until the results statistically significant. This is explained starting here in the video.
- A way to see generated (and shrunk) examples of the classifications. This is explained starting here in the video.
- A way to view the percentages for the classifications.
Unfortunately I don't really have the capacity to really delve into new code-bases at the moment. If the changes turn out to be fairly simple, I may be able to help out if given some help and pointers about where to change what. But given that the "run more tests until statistics are good enough" part of point 1 requires reading and understanding a statistics research paper, I'm not sure it is simple. Though @moodmosaic said above that "this sort of thing" (whatever it was) was fairly simple to write, so here's hoping.
In any case, I wanted to share my thoughts.
Essentially what we want is to add coverage combinators: cover
, classify
, label
, and collect
.
Notes
In the Haskell version:
cover
records the number of times a predicate is satisfied and displays the result as a percentage. If the percentage doesn’t meet your threshold then the test fails.
classify
works the same as cover but is purely informational and doesn’t have a threshold below which it will fail the test.
label
is like classify but doesn’t have a predicate, so it simply tracks the percentage of tests run which hit a certain line of code.
collect
is like label but can use sprintf "%A":
on its argument to create the label name.
In the early 2016 .NET/F# version:
Lines 453 to 511 in this gist, ported from the original QuickCheck (v1) paper, show a rough/naive implementation of those.
cover
records the number of times a predicate is satisfied and displays the result as a percentage. If the percentage doesn’t meet your threshold then the test fails.
According to the video, it also runs more tests as needed to ensure the result is statistically significant.
Also, not sure what you mean by "combinators". Syntax-wise, this could also take the form of custom keywords for the property
CE, right?
property {
let! myInt = Gen.int32 (Range.exponentialBounded())
classify "zero" (myInt = 0)
}
Though if the alternative is something like
property {
let! myInt = Gen.int32 (Range.exponentialBounded())
do! Gen.classify "zero" (myInt = 0)
}
then I don't really care much either way.
Agreed 👍 That's what I mean; custom CE keywords are better. Those keywords may use the underlying combinators (to be added) in the Property module.
This is such a nerd snipe for me. All your links to external references and explanation of the Haskell implementation will be very helpful. Now the only question is when I will get to this.
According to the video, it also runs more tests as needed to ensure the result is statistically significant.
@cmeeren, cover
won't run more tests; that's happening because of checkCoverage
. The equivalent in hedgehog
is
checkCoverage :: Property -> Property
checkCoverage =
verifiedTermination . withConfidence (10^9)
I'll provide an example of QuickCheck's checkCoverage
side-by-side with Hedgehog's API and link it here, in case it helps.
I'll provide an example of QuickCheck's
checkCoverage
side-by-side with Hedgehog's API and link it here, in case it helps.
Great, I would appreciated that!
@cmeeren, a good example of QuickCheck's checkCoverage
can be found in this StackOverflow answer by @ploeh.
I ported all properties from that answer in Hedgehog and created a custom test-runner that'll run sequentially both the original QuickCheck properties and the Hedgehog ones.
https://github.com/moodmosaic/coverage-check-example
(Install GHCUp, clone the above repo, cd
into the directory, and run it via cabal test --test-show-details=streaming
. You might have to run chcp 65001
if the bars aren't rendering nicely.)
@TysonMN, if this is ported in F# Hedgehog, perhaps we can have the same API as in Haskell Hedgehog but if possible we'd rather take the statistical part from QuickCheck instead, if my comment in Haskell Hedgehog isn't resolved soon.
The equivalent in
hedgehog
is
Oh, I see, you were referring to Haskell Hedgehog, not F#. I thought you were referring to F# and that this has somehow been implemented under the radar.
In any case, thanks for the clarification. 🙂
I'm beginning to realise that after I got clued into this capability I use it more and more. It enables me to write simpler properties. If I may, here's a simpler example than the one linked above.
Provoked by Robert C. Martin I was recently doing the Gossiping Bus Drivers kata in Haskell, and I was writing a simple property to verify the image of my System Under Test (SUT), which is this function:
drive :: (Num b, Enum b, Ord a) => [[a]] -> Maybe b
As you can see, the output is a Maybe
value, in this case because there are some (actually quite a few) inputs that will never produce an answer. If, however, it produces an answer, the value should be between 0 and 480 (consult the linked kata description if you're curious as to why that is).
The simplest way I can think of to express this property is this:
testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
let actual = drive $ fmap getNonEmpty routes
in checkCoverage $
cover 75 (isJust actual) "solution exists" $
all (\i -> 0 <= i && i <= 480) actual
Why use cover
and checkCoverage
? This is because actual
is a Maybe Integer
, and all
works on any Foldable
instance, including Maybe
. In other words, the assertion verifies exactly what I stated above that the property should be: If actual
holds a value, it should be in that particular interval.
What's the easiest way to pass that test, if, for example, you'd employ the Devil's Advocate?
Just return Nothing
.
So I wanted to ensure that the Devil can't do that. How do I do that?
Without cover
and checkCoverage
I'd typically write complex arrange code to generate only valid input. This does have a tendency to make one repeat the implementation details of the SUT, so I'm always on the lookout for better alternatives.
I find that cover
and checkCoverage
neatly address that concern. I just tell QuickCheck that I don't really care exactly how it does it, but that I want it to generate 'enough' Maybe
cases.
Why 75%? That particular number was just a result of a bit of trial and error. I didn't really care about the particular percentage, just that it was comfortably greater than zero, so that I knew that there would be multiple test cases that cover the Maybe
partition.
For this particular test, it makes QuickCheck generate 200 tests in order to pass the 75% requirement.
To start off with a great reference, you are preferring predicative over constructive data.
I find that
cover
andcheckCoverage
neatly address that concern. I just tell QuickCheck that I don't really care exactly how it does it, but that I want it to generate 'enough'Maybe
cases. [...] For this particular test, it makes QuickCheck generate 200 tests in order to pass the 75% requirement.
I find it confusing to use cover
in that way. Instead, I expect cover 75
to produce enough cases that it is X
% confident that your generator produces at least 75% of its test cases as you assert...for some default value of X
. It could be 6σ
(or 99.9996%) as is the practice in the "stubborn" field of physics, or it could be 2σ
(or 95%) as is the practice in the social sciences (which have a replication crisis). I vaguely recall John Hughes saying which value QuickCheck picked in his talk at Lambda Days 19 talk.
I think a more direct approach to solve this problem would be to use https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Gen.fs#L290-L291
F# Hedgehog continues generating test cases until the desired quantity (100 by default have passed... https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L193-L196
...one has failed... https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L210-L213
...or 100 generated values have been skipped due to predicates passed to Gen.filter
returning false
:
https://github.com/hedgehogqa/fsharp-hedgehog/blob/253634bdcf0a64f0b7e61c54d6301a240d50da0f/src/Hedgehog/Property.fs#L197-L200
So I think the simplest change is for us to add the ability to set the discard
number to an optional int. Then you could set it to None
and set TestLimit
to 150 (which is 75% of 200).
@TysonMN, thank you for challenging my assumptions. As everyone else, I'm vulnerable to the Golden Hammer syndrome.
I've always thought of property filters as something one puts in the beginning of the test, as a sort of preamble. Now that you suggest it, it turns out that there's no reason you can't use it in the assertion step:
testProperty "drive image" $ \ (routes :: [NonEmptyList Int]) ->
let actual = drive $ fmap getNonEmpty routes
in isJust actual ==>
all (\i -> 0 <= i && i <= 480) actual
This is, indeed, simpler! Cool! Thank you.