anglican Benchmark test fails locally (preventing publish of forked repository)

Benchmark test fails locally (preventing publish of forked repository)

Open LSaldyt opened this issue 6 years ago • 5 comments

When issuing lein test, the following test fails on my machine:

lein test :only anglican.algorithm-test/test-benchmarks

FAIL in (test-benchmarks) (algorithm_test.clj:284)
[:gaussian :lmh]
expected: (< error (:threshold benchmark))
  actual: (not (< 0.2106529389611469 0.1))

A few other tests fail at the same line location. Is there some kind of hard-coded machine-specific value that is causing this?

Jun 26 '18 22:06 LSaldyt

These are statistical tests implemented incorrectly (in the sense that they fail sometimes). Disable algorithm tests for publishing until this is fixed properly.

Jul 09 '18 13:07 dtolpin

Feel free to close this, it was just FYI, not a preventative issue.

Jul 09 '18 14:07 LSaldyt

This is what I wrote a long time ago but nobody implement it regretfully:

stochastic tests are both probabilistic and time consuming. If we want them to succeed with high probability, they will either run too long or the margins will be too loose - neither is desirable.

A solution is to make them iterative, with reasonable bounds on both accuracy and the number of iterations.

(loop [succeeded false i 0] (if (or succeeded (= i number-of-iterations)) (is succeeded) (recur (run-test) (inc i))))

This way, the test will terminate sooner if succeeded, but will run for at most number-of-iterations to succeed.

I do not propose to make tests looser. On the contrary, I propose to make them tighter. What we've recently done is that we made a correct test looser because it failed sometimes. I believe that in such cases one should simply increase the number of iterations. There is a simple justification for that.

The number of iterations to succeed for a test increases exponentially with the test accuracy. That is, if your correct test fails with probability 0.1, you need three iterations to go from probability 0.9 to probability 0.999 of success.

If, however, the test should fail because of a bug, it will most probably fail with high probability. And even with the failure probability of 0.9 after three iterations it will still fail with probability 0.90.90.9=0.569. So by increasing the number of iterations you go up from 0.1 to 0.001 for false negatives, while false positives are still sufficiently low --- you test suit will fail on every other invocation.

Going up to the precision of a single iteration of 0.95, you'll get false negative rate of 0.000001 and false positive rate of 0.143. Pretty spectacular.

On the other hand, if you want to achieve this result with a single iteration and use some form of Monte Carlo sampling (which has convergence rate of sqrt(N)), to go from 0.1 to 0.001 false negatives you'll need one MILLION as many iterations. Not feasible.

That's why a single test should be set up for reasonable accuracy, like 0.95, and then repeated up to a relatively small number of iterations to dramatically increase the rate of success without impairing the recall too much. The way we run the tests, if it even fails occasionally, we'll see the bug. And 90% of failure on a bug is good enough.

Jul 09 '18 14:07 dtolpin

Just the opposite, I am going to keep this open here too. I think there is an issue on this on bitbucket (where things belong, historically). Maybe I should figure out a way to accept patches to both repositories (tweak the behind-the-scenes sync script).

Jul 09 '18 14:07 dtolpin

I guess the hardest part of a two-way sync would be the resolution of merge conflicts... You might be able to avoid these by having the two repositories sync very frequently (say, every minute or so), because then you would only get merge conflicts if two conflicting commits were pushed in the same minute. Just an idea. Let me know once you get it figured out. Otherwise, I could make some PRs to the bitbucket version later this week. Lastly, I'm sure you have your reasons for using bitbucket, but it seems that github is more standard. Cheers, - Lucas

Jul 11 '18 01:07 LSaldyt

anglican anglican copied to clipboard

Benchmark test fails locally (preventing publish of forked repository)

anglican
anglican copied to clipboard