Add the possibility to compute statistics
This adds the possibility to compute statistics (count all runs of a property, along with the number of times it fails and the number of times it triggers an exception) for all tests. The main goal is to provide a sound argument whether a given change has a (positive or negative) impact on the ability of a test to detect issues. It is an incomplete draft, but advanced enough that we can decide whether it should be completed.
As we have no statistics to back the assumption that this PR has no impact, it tries to be as non-intrusive as possible. Here is how it proposes to do this:
-
statistics are enabled by setting the environment variable
$QCHECK_STATISTICS_FILEto the path of a file in which the stats will be recorded (I think that adding a command-line option would require duplicatingQCheck_base_runner.Raw.parse_cli, which I would rather avoid) -
a couple of QCheck functions are “routed” through
Utilso that their behaviours can be adapted when computing stats:-
fail_reportfis reexported so that it returnsfalserather than raising an exception (so that failures and exceptions can be distinguished), -
make_neg_testwrapsTest.make_negso that it usesTest.makewhen statistics are enabled (because tests will always report success, and count failures separately);make_testwrapsTest.makefor uniformity -
run_tests_mainis duplicated in order to run each test separately when statistics are enabled so that we can record the stats for each test separately and to enable the long version of tests (so$QCHECK_LONG_FACTORcan be used to multiply the number of tests to run)
Note that this routing could be useful also to add optional hooks before and after each test (what #365 and #367 have done by hand for weak and ephemeron tests).
-
-
most of the counting logic is done in
Util.repeat: when stats are enabled, it uses a different loop that will count iterations and negative outcomes rather than stop at the first failure; this means that even the tests not using repetitions are now wrapped into arepeat 1to get statistics working -
Util.Stats.enabledis exposed so that tests can choose a different repetition count depending on this.
A possible alternate design that would not rely quite so much on global state in Util (and would thus allow to use run_tests without splitting test lists) would be to use Util.make_test and Util.make_neg_test to expect a Stats.t -> 'a -> bool function (that would be the result of repeat, with the added benefit that the change of type detects places that are not updated...) and generate fresh stats for each test. It is not yet in that version of the PR because I’m not sure which one is the best choice.
To do before merging, if this proposal is good enough:
- choose one of the two possible designs,
- split up a commit with just the rerouting through
Utilto avoid useless history noise, - cover the tests that are not yet covered (discovered through the experiment with the change in
repeatandmake_testsignatures), - document the new functions exposed by
Utilor the changes of behaviours, - remove the commit that choose stats in GH CI runs,
- integrate weak and ephemeron tests in that standard workflow.
To do in a later PR:
- add dune targets that run the stats; I’d rather move that to a different PR because the way I have in mind to integrate that requires a lot of boilerplate dune config so it would make sense to take advantage of this to also split the test suite into smaller parts (I’m thinking about 4 aliases: stable, experimental, complete (to cover the first two) and deprecated), along with whether stats should be enabled or not. A small script would generate all that dune config.
Wow, I was just hacking on statistics myself in order to improve the Out_channel situation...
Here's a branch: https://github.com/ocaml-multicore/multicoretests/tree/generic-stats
I considered adding bindings to Util as we discussed, but instead went with deriving a generic stats-test from a Lin or STM specification. That lends itself to separating a test specification from a test-runner/stats-driver.
The down-side is having to write separate drivers.
On the other hand, it decouples the stats from negative/positive testing and any platform-disabling we are doing.