Add the possibility to compute statistics

Open shym opened this issue 2 years ago • 1 comments

This adds the possibility to compute statistics (count all runs of a property, along with the number of times it fails and the number of times it triggers an exception) for all tests. The main goal is to provide a sound argument whether a given change has a (positive or negative) impact on the ability of a test to detect issues. It is an incomplete draft, but advanced enough that we can decide whether it should be completed.

As we have no statistics to back the assumption that this PR has no impact, it tries to be as non-intrusive as possible. Here is how it proposes to do this:

statistics are enabled by setting the environment variable $QCHECK_STATISTICS_FILE to the path of a file in which the stats will be recorded (I think that adding a command-line option would require duplicating QCheck_base_runner.Raw.parse_cli, which I would rather avoid)
a couple of QCheck functions are “routed” through Util so that their behaviours can be adapted when computing stats:
- fail_reportf is reexported so that it returns false rather than raising an exception (so that failures and exceptions can be distinguished),
- make_neg_test wraps Test.make_neg so that it uses Test.make when statistics are enabled (because tests will always report success, and count failures separately); make_test wraps Test.make for uniformity
- run_tests_main is duplicated in order to run each test separately when statistics are enabled so that we can record the stats for each test separately and to enable the long version of tests (so $QCHECK_LONG_FACTOR can be used to multiply the number of tests to run)
Note that this routing could be useful also to add optional hooks before and after each test (what #365 and #367 have done by hand for weak and ephemeron tests).
most of the counting logic is done in Util.repeat: when stats are enabled, it uses a different loop that will count iterations and negative outcomes rather than stop at the first failure; this means that even the tests not using repetitions are now wrapped into a repeat 1 to get statistics working
Util.Stats.enabled is exposed so that tests can choose a different repetition count depending on this.

A possible alternate design that would not rely quite so much on global state in Util (and would thus allow to use run_tests without splitting test lists) would be to use Util.make_test and Util.make_neg_test to expect a Stats.t -> 'a -> bool function (that would be the result of repeat, with the added benefit that the change of type detects places that are not updated...) and generate fresh stats for each test. It is not yet in that version of the PR because I’m not sure which one is the best choice.

To do before merging, if this proposal is good enough:

choose one of the two possible designs,
split up a commit with just the rerouting through Util to avoid useless history noise,
cover the tests that are not yet covered (discovered through the experiment with the change in repeat and make_test signatures),
document the new functions exposed by Util or the changes of behaviours,
remove the commit that choose stats in GH CI runs,
integrate weak and ephemeron tests in that standard workflow.

To do in a later PR:

add dune targets that run the stats; I’d rather move that to a different PR because the way I have in mind to integrate that requires a lot of boilerplate dune config so it would make sense to take advantage of this to also split the test suite into smaller parts (I’m thinking about 4 aliases: stable, experimental, complete (to cover the first two) and deprecated), along with whether stats should be enabled or not. A small script would generate all that dune config.

Jun 21 '23 12:06 shym

Wow, I was just hacking on statistics myself in order to improve the Out_channel situation... Here's a branch: https://github.com/ocaml-multicore/multicoretests/tree/generic-stats I considered adding bindings to Util as we discussed, but instead went with deriving a generic stats-test from a Lin or STM specification. That lends itself to separating a test specification from a test-runner/stats-driver. The down-side is having to write separate drivers. On the other hand, it decouples the stats from negative/positive testing and any platform-disabling we are doing.

Jun 21 '23 12:06 jmid