hypothesis Hypothesis Seed not reproducing "max allowable size" HealthCheck

I have a test that is intermittently failing with hypothesis.errors.FailedHealthCheck: Examples routinely exceeded the max allowable size. (see https://gist.github.com/alanhdu/dc2248ffc5697e91c84944ce85e63a16 as an example). Unfortunately, I have not been able to figure out how to reproduce this reliably -- when I use the provided --hypothesis-seed from the failure message to invoke my tests, all the tests pass and the health check is not triggered!

This is on hypothesis 6.47.1 and python 3.8 (instaleld via conda). Unfortunately, this is in a fairly large internal codebase that I can't share, and I haven't been able to find a minimal reproducing example, so I'm not able to give a tangible information. Some things that might be helpful though:

Specifying --hypothesis-seed does not reproduce the test failure (either by running the single test alone or by running the entire test suite)
I wrote a small bash script to execute py.test selecting the individual test 10000 times, and none of those times triggered the health check failure. I wonder whether reproducing this somehow requires running the full test suite (as opposed to just the single test).
I feel like the actual test shouldn't trigger this failure -- it draws at most 6 items, all of which are fairly small

Draw 1: st.sampled_from([16, 8, 6])
Draw 2: st.sampled_from([(-1, 0, 1), (-1.0, 0.0, 1.0), (-0.5, 0.5), (0,)])
Draw 3: st.boolean()  # optional
Draw 4: st.lists(st.integers(1, 8), min_size=1, max_size=3)
Draw 5: st.sampled_from(["mean", "max"])
Draw 6: st.lists(st.integers(1, 5), min_size=2, max_size=5))   # turned into a `torch.Tensor` FWIW)

We have a bit of an odd testing setup which may or may not be related. Basically, we have something like::

class Assertions(metaclass=abc.ABCMeta)
    @abc.abstractmethod
    def construct_object(self, data: st.DataObject) -> OurInterface:
        ...

    # there are several of these tests for different parts of the `OurInterface` contract    
    @hypothesis.given(st.data())
    def test_a(self, data):  
        obj = self.construct_object(data)
        # rest of test

# Several test cases creating different objects        
class TestCase1(Assertions):
    def construct_object(self, data):
        ...

This was a way for us to easily test multiple aspects of our interface contract given a single construct_object implementation, but I wonder whether the inheritance is somehow "entangling" some hypothesis internal state causing this unreprudible health check failure.

Any advice on how to debug this would be much appreciated! I will (of course) continue pursuing this on our side as well.

Aug 24 '22 19:08 alanhdu

Hmm. The seed should reproduce it, even for a healthcheck that analyses multiple examples - we're careful to always use our internal PRNG with that known seed. Instead of turning up the number of examples (if the healthcheck doesn't fire in 100 it's probably not going to), I'd try just scanning through seeds upwards from zero looking for one that does reproduce.

Your test does look like it's got firmly-bounded size, so I'd guess that you do indeed have some weird state-pollution thing going on. That said, I'd be pretty surprised if you can trigger such problems via our public API, and of course if that is the case we'll consider it a serious bug. (practically, once I'm back from no-internet-access vacation around Sep 7)

https://github.com/pytest-dev/pytest-randomly might also be helpful for understanding which tests might interact, if you wrap some tooling around to bisect.

Aug 25 '22 06:08 Zac-HD

nstead of turning up the number of examples (if the healthcheck doesn't fire in 100 it's probably not going to), I'd try just scanning through seeds upwards from zero looking for one that does reproduce.

This sounds like a good idea. I'll dig in and try this out.

That said, I'd be pretty surprised if you can trigger such problems via our public API, and of course if that is the case we'll consider it a serious bug. (practically, once I'm back from no-internet-access vacation around Sep 7)

Good to know -- I'll continue working on how to reproduce + shrink this test failure.

practically, once I'm back from no-internet-access vacation around Sep 7)

No worries -- I'm about to go on leave for about a month, so I'm not in a rush here. I'll let you know if/when I get more information here.

Aug 26 '22 21:08 alanhdu

Ok -- I was finally able to reproduce this and have a somewhat minimal example. If you use the following file https://github.com/alanhdu/hypothesis_3446/blob/591e2efc1e3241f72c2a01352cc3e7e7a116f7ab/test_hypothesis.py and call it test_hypothesis.py and then execute the following bash script:

set -ex

for i in {1..100}
do
    echo $i
    pytest test_hypothesis.py
done

I can reliably reproduce the error. (I put up a repository at https://github.com/alanhdu/hypothesis_3446 with exact reproducibility instructions using a Docker image if that helps).

I will continue to try to minimize the error, but this error seems weird in a couple of ways:

Like I said, the --hypothesis-seed=... flag does not seem to actually reproduce the error
Deleting any one of the tests causes this error to not show up, even though the actual error always shows up in Test_B. Running Test_B by itself does not seem to be enough to reproduce this error.
The error seems to be very reliant on the exact sequence of draws -- I've tried minimizing the set of draws down further without a huge amount of success (although maybe I just need to run more than 100 iterations?).

I will trying to pin down the exact problem on my side as well, but hopefully this is enough breadcrumbs to confirm there is a problem here!

Sep 29 '22 23:09 alanhdu

Hi @Zac-HD -- I was wondeirng whether you had time to check out https://github.com/alanhdu/hypothesis_3446? I tried taking a look at hypothesis and admittedly did not get very far (the error seems to be fairly deep in the guts of conjecture, which I'm not very familiar with it).

Oct 17 '22 18:10 alanhdu

Not yet, sorry! Getting https://github.com/Zac-HD/hypofuzz off to a good start has taken most of my free time lately 😅

Oct 24 '22 04:10 Zac-HD

No worries! That seems like a big, exciting project.

I got some more time at work to look into this and have a couple of (hopefully helpful) observations:

hypothesis.seed seems to be working weirdly? If I manually set the seed within hypothesis, either in get_random_for_wrapped_test like:

--- a/hypothesis-python/src/hypothesis/core.py
+++ b/hypothesis-python/src/hypothesis/core.py
@@ -460,6 +460,8 @@ def get_random_for_wrapped_test(test, wrapped_test):
     settings = wrapped_test._hypothesis_internal_use_settings
     wrapped_test._hypothesis_internal_use_generated_seed = None
 
+    return Random(120782952025811219996508021252125052458)
     if wrapped_test._hypothesis_internal_use_seed is not None:
         return Random(wrapped_test._hypothesis_internal_use_seed)
     elif settings.derandomize:

or in the ConjectureRunner like:

--- a/hypothesis-python/src/hypothesis/internal/conjecture/engine.py
+++ b/hypothesis-python/src/hypothesis/internal/conjecture/engine.py
@@ -92,7 +92,8 @@ class ConjectureRunner:
         self.finish_shrinking_deadline = None
         self.call_count = 0
         self.valid_examples = 0
-        self.random = random or Random(getrandbits(128))
+        self.random = Random(120782952025811219996508021252125052458)
+
         self.database_key = database_key
         self.ignore_limits = ignore_limits

then I can reproduce this 100% of the time. But if I use hypothesis.seed instead:

--- a/test_hypothesis.py
+++ b/test_hypothesis.py
@@ -6,6 +6,7 @@ class BasicAssertions:
         return
 
     @hypothesis.given(data=st.data())
+    @hypothesis.seed(120782952025811219996508021252125052458)
     @hypothesis.settings(max_examples=20)
     def test_1(self, data: st.DataObject):
         self.create(data)

this does not reproduce! I've also checked that self.random.getstate() == Random(120782952025811219996508021252125052458).getstate() when I set @hypothesis.seed, so the seed is getting "plumbed through" correctly.

To me, this implies that @hypothesis.seed actually influences the order of Random calls that we make which is probably why we couldn't reproduce this in the first place.

This maybe is a problem with the database? If I set the seed in hypothesis/internal/conjecture/engine.py, then I can do:

$ rm -rf .hypothesis
$ py.test test_hypothesis.py  # passes!
$ py.test test_hypothesis.py  # fails!

so there's some kind of on-disk state that is maybe causing the actual error as well.

I cannot seem to reproduce this on my minimal reproduction if I set hypothesis.settings(database=None), which is maybe another pointer that there's some bad interaction with the example database here.

Oct 24 '22 19:10 alanhdu

I've done a little more digging and think there are two (or three) potential interrelated problems here:

When hypothesis.seed is set, then hypothesis.settings(database=None) is also implicitly set, which explains why I can't reproduce this with hypothesis.seed. The current behavior makes sense since it removes an implicit dependency on file system state (which can't be controlled by hypothesis), but it's still an external source of irreproduciblity. I'm not sure what to do here -- is it possible to also encode the database as a base64 string. Alternatively, maybe just mentioning it in the same message that displays the hypothesis.seed might be good enough as a pointer.
In general, it looks database_key is set by calling function_digest to "hash" the function. That's clever, but it also means that this situation (same function but different class) is basically a guaranteed collision. I think this can be solved by changing:

https://github.com/HypothesisWorks/hypothesis/blob/47b35ce128376d362c636d76479b68f2524c37d6/hypothesis-python/src/_hypothesis_pytestplugin.py#L266-L268

In the same way that we add the item.nodeid to the hash stream for parametrized functions, make we can do the same for methods? So that would involve changing that if statement to if item.get_closest_marker("parametrize") is not None or item.cls is not None:. Would there be any downside to that?

Then there's still the problem of why such a collision produces the error still -- function_digest is explicitly documented to not guarantee uniqueness, which implies that things should still work even with this collision (although I can't say I've managed to reproduce this yet with database=None set). I'm still a little stumped on what's going on here -- the error seems to come from the "shrinkage" phase of the example generator,

@Zac-HD -- do you have thoughts on what to do next on any of these issues? And in particular, would you prefer if I split this into multiple issues to discuss them separately?

Oct 24 '22 21:10 alanhdu

Hi @alanhdu, my analysis:

I think you're correct that the problem stems from database key collisions. In particular, that the tests effectively draw different amounts and when replaying an example with stored values (from another instance which draws less), it is reported as OVERRUN. There are probably other problems (non-reproducibility), but those are not as visible.

I don't think the proposed change (adding cls to hash) will be the solution though - the class at construction time is BasicAssertions, and it's only later that the real test class is found in the self parameter.

Also, it's a special case of something that will always be problematic: dynamic behaviour outside of hypothesis' control, here controlled by the test runner directly - e.g., which subclass that calls test_1. Bringing it back under hypothesis control can be done either via pytest/plugin (parametrize over subclasses) or directly (st.sampled_from subclasses).

Example:

@pytest.mark.parametrize("cls", BasicAssertions.__subclasses__())
@hypothesis.given(data=st.data())
@hypothesis.settings(max_examples=20)
def test_1(cls: BasicAssertions, data: st.DataObject):
    cls().create(data)
    data.draw(st.lists(st.integers(1, 5), min_size=2, max_size=5))

So yes, I think that database key collisions are problematic for db-example replays, if the colliding entries have different behaviour.

Jul 16 '23 08:07 jobh

To summarize, and some action points for this issue:

Database key collisions are possible, and lead to examples being replayed for the wrong test. This is usually not a big problem, just wasted work; but in this case it leads to a spurious health check failure.

Unconditionally mixing in fun.__qualname__ in the db-key will make this slightly less probable, with little adverse effect on key stability. It will not fix this issue though, since the test function is defined on the parent class.
Disable this health check while replaying examples. Maybe others, too?

Item (2) above will make the error go away for the present issue by hiding the underlying db-key collisions. That's ok since they are fairly benign. But we could add a new health check to warn in at least some of these cases.

I can try to create a PR for this, if @Zac-HD can point me to a clean way to achieve (2), disable-health-check.

...it also occurs to me, is it possible at present to set a stable test id (db-key) explicitly through settings or somesuch?

Aug 13 '23 08:08 jobh

@Zac-HD Turns out that fixing the overrun issue takes more thought than just disabling health checks during replay. Disabling the replay does fix it, so I think much of the analysis above is on the right track, but the actual overrun happens later. I won't be able to follow up on this for the moment.

Sep 03 '23 11:09 jobh

I'm pretty sure this is fixed by #3862, at last!

Jan 30 '24 06:01 Zac-HD

hypothesis hypothesis copied to clipboard

Hypothesis Seed not reproducing "max allowable size" HealthCheck

hypothesis
hypothesis copied to clipboard