[experimental] Run crosshair in CI
See https://github.com/HypothesisWorks/hypothesis/issues/3914
To reproduce this locally, you can run make check-crosshair-cover/nocover/niche for the same command as in CI, but I'd recommend pytest --hypothesis-profile=crosshair hypothesis-python/tests/{cover,nocover,datetime} -m xf_crosshair --runxfail to select and run only the xfailed tests.
Hypothesis' problems
- Vast majority of failures are
Flaky: Inconsistent results from replaying a failing test...- mostly backend-specific failures; we've both- [x] improved reporting in this case to show the crosshair-specific traceback
- [x] got most of the affected tests passing
- [x] Invalid internal boolean probability, e.g.
"hypothesis/internal/conjecture/data.py", line 2277, in draw_booleanassert p > 2 ** (-64), fixed in1f845e0(#4049) - [x] many of our test helpers involved nested use of
@given, fixed in https://github.com/HypothesisWorks/hypothesis/commit/3315be63163218f5b4027128e80a2b856b512fcc - symbolic outside context
- [x] due to charmap, fixed in https://github.com/HypothesisWorks/hypothesis/commit/48e89a6a4f920be01c6163e986dd0051541a5ac4
- [x] due to
target(), fixed in85712ad(#4049)
- [x] avoid uninstalling
typing_extensionswhen crosshair depends on it - [x] tests which are not really expected to pass on other backends. I'm slowly applying a backend-specific xfail decorator to them,
@xfail_on_crosshair(...).- [x] tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only
.too_slowand.filter_too_much, and skip remaining affected tests under crosshair. - [x] undo some over-broad skips, e.g. various xfail decorators, pytestmarks,
-k 'not decimal'once we're closer
- [x] tests which expect to raise a healthcheck, and fail because our crosshair profile disables healthchecks. Disable only
- [x] provide a special exception type for when running the test or realizing values would hit a
PathTimeout; see https://github.com/pschanely/hypothesis-crosshair/issues/21 and https://github.com/HypothesisWorks/hypothesis/issues/3914#issuecomment-2277023708- [x] and something to signal that we've exhausted Crosshair's ability to explore the test. If this is sound, we've verified the function and can stop! (and should record that in the stop_reason). If unsound, we can continue testing with Hypothesis' default backend - so it's important to distinguish. https://github.com/HypothesisWorks/hypothesis/pull/4092
Probably Crosshair's problems
- [x] Repeated-registration error, see https://github.com/pschanely/hypothesis-crosshair/issues/17
- [x]
RecursionError, see https://github.com/pschanely/CrossHair/issues/294 - [x]
unsupported operand type(s) for -: 'float' and 'SymbolicFloat'intest_float_clamper - [x]
TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object(or'values'or'items'). Fixed in https://github.com/pschanely/CrossHair/pull/269 - [x]
TypeError: _int() got an unexpected keyword argument 'base' - [x] Buffer not realized for hash function, fixed in https://github.com/pschanely/CrossHair/issues/272
- [x] Internal error for case-insensitive regex https://github.com/pschanely/CrossHair/issues/274
- [x]
typing.get_type_hints()raisesValueError, see https://github.com/pschanely/CrossHair/issues/275 - [x] json round-trip error below
- [x]
TypeErrorin bytes regex, see https://github.com/pschanely/CrossHair/issues/276 - [x] Invalid args to
provider.draw_boolean()insideFeatureStrategy, see https://github.com/pschanely/hypothesis-crosshair/issues/18 - [x] Support
dict(name=value), see https://github.com/pschanely/CrossHair/issues/279 - [x] Error in
PurePathconstructor, see https://github.com/pschanely/CrossHair/issues/280 - [x]
zlib.compress()not symbolic, see https://github.com/pschanely/CrossHair/issues/286 - [x]
int.from_bytes(map(...), ...), see https://github.com/pschanely/CrossHair/issues/291 - [x] base64 support, see https://github.com/pschanely/CrossHair/issues/293
- [ ]
TypeError: conversion from SymbolicInt to Decimal is not supported; see also snan below - [x]
TypeVarproblem, see https://github.com/pschanely/CrossHair/issues/292 - [ ] Crash on way-too-large integer, see https://github.com/pschanely/CrossHair/issues/285
- [x]
RecursionErrorinside Lark, see https://github.com/pschanely/CrossHair/issues/297 - [ ] https://github.com/pschanely/CrossHair/issues/307
Error in operator.eq(Decimal('sNaN'), an_int)
____ test_rewriting_does_not_compare_decimal_snan ____
File "hypothesis/strategies/_internal/strategies.py", line 1017, in do_filtered_draw
if self.condition(value):
TypeError: argument must be an integer
while generating 's' from integers(min_value=1, max_value=5).filter(functools.partial(eq, Decimal('sNaN')))
Cases where crosshair doesn't find a failing example but Hypothesis does
Seems fine, there are plenty of cases in the other direction. Tracked with @xfail_on_crosshair(Why.undiscovered) in case we want to dig in later.
Nested use of the Hypothesis engine (e.g. given-inside-given)
This is just explicitly unsupported for now. Hypothesis should probably offer some way for backends to declare that they don't support this, and then raise a helpful error message if you try anyway.
Here's a diff for a few of the niche failures:
diff --git a/hypothesis-python/tests/cover/test_testdecorators.py b/hypothesis-python/tests/cover/test_testdecorators.py
index 0cb9cd3c2..30af2e2e8 100644
--- a/hypothesis-python/tests/cover/test_testdecorators.py
+++ b/hypothesis-python/tests/cover/test_testdecorators.py
@@ -13,6 +13,7 @@ import threading
from collections import namedtuple
from hypothesis import HealthCheck, Verbosity, assume, given, note, reporting, settings
+from hypothesis.internal.conjecture.data import realize
from hypothesis.strategies import (
binary,
booleans,
@@ -311,7 +312,7 @@ def test_can_derandomize():
@given(integers())
@settings(derandomize=True, database=None)
def test_blah(x):
- values.append(x)
+ values.append(realize(x))
assert x > 0
test_blah()
@@ -479,7 +480,6 @@ def test_empty_lists(xs):
def test_given_usable_inline_on_lambdas():
- xs = []
- given(booleans())(lambda x: xs.append(x))()
- assert len(xs) == 2
- assert set(xs) == {False, True}
+ xs = set()
+ given(booleans())(lambda x: xs.add(realize(x)))()
+ assert xs == {False, True}
- [ ]
tests/cover/test_testdecorators.py::test_can_find_large_sum_frozensetlooks like potentially a crosshair weakness, but I don't know if the ir allows it to reason effectively at the set level. (we can justskipif(CROSSHAIR)if not). - [ ]
tests/cover/test_testdecorators.py::TestCases::test_float_addition_cancels: unsure. doesn't reproduce locally. - [ ]
tests/cover/test_testdecorators.py::test_when_set_to_no_simplifies_runs_failing_example_twiceis probably hypothesis converting backend ir to a buffer.
Not sure about the many flaky recursion errors on cover and nocover. I saw this behavior locally too, where tests are fine when run in isolation, and the first n tests are also green, but at some point a flip switches and many tests start failing with this error. I wonder if crosshair has some persistent state incrementing across test runs?
test_given_usable_inline_on_lambdas is basically a failure of deduplication, where Hypothesis will run only two inputs through a test that accepts a single boolean argument. Unclear whether this matters for Crosshair; noticing that you've exhausted some state-space is a cute trick for easy problems.
I sketched out a nice system for xfailing tests under crosshair, where you can also -m xf_crosshair --runxfail to see all the failures live... and then looked at the far more numerous cover and nocover failures. Well, pushing it anyway...
@Zac-HD your triage above is SO great. I am investigating.
Knocked out a few of these in 0.0.60. I think that means current status on my end is:
- [ ] TypeError: conversion from SymbolicInt to Decimal is not supported
- [X] Unsupported operand type(s) for -: 'float' and 'SymbolicFloat' in test_float_clamper
- [X] TypeError: descriptor 'keys' for 'dict' objects doesn't apply to a 'ShellMutableMap' object (or 'values' or 'items').
- [X] TypeError: _int() got an unexpected keyword argument 'base'
- [ ] Symbolic not realized (in e.g. test_suppressing_filtering_health_check)
- [ ] Error in operator.eq(Decimal('sNaN'), an_int)
- [ ] Zac's cursed example below!
More soon.
Ah - the Flaky failures are of course because we had some failure under the Crosshair backend, which did not reproduce under the Hypothesis backend. This is presumably going to point to a range of integration bugs, but is also something that we'll want to clearly explain to users because integration bugs are definitely going to happen in future and users will need to respond (by e.g. using a different backend, ignoring the problem, whatever).
- [x] improve the reporting around
Flakyfailures where the differing or missing errors are related to a change of backend while shrinking. See also https://github.com/HypothesisWorks/hypothesis/issues/4040. - [x] triage all the current failures so we can fix them
OK, here's a cursed one:
import sys
from hypothesis import given, settings, strategies as st
from hypothesis.internal import charmap as cm
@settings(backend="crosshair")
@given(
st.sets(st.sampled_from(cm.categories())) | st.none(),
st.integers(0, sys.maxunicode),
st.integers(0, sys.maxunicode),
)
def test_a(cats, m1, m2):
m1, m2 = sorted((m1, m2))
cm.query(categories=cats, min_codepoint=m1, max_codepoint=m2)
# test_a()
@settings(backend="crosshair")
@given(
st.integers(0, sys.maxunicode),
st.integers(0, sys.maxunicode),
)
def test_b(m1, m2):
m1, m2 = sorted((m1, m2))
cm.query(min_codepoint=m1, max_codepoint=m2)
test_b()
running python demo.py raises HypothesisException: expected <class 'int'> from CrossHairPrimitiveProvider.realize, got <class 'crosshair.libimpl.builtinslib.SymbolicInt'>, so it seems that the realize() method isn't working.
But! If I comment out test_a - which doesn't run - then I instead get CrosshairInternal: Numeric operation on symbolic while not tracing.
Most/all of the "expected x, got symbolic" errors are symptoms of an underlying error in my experience (often operation on symbolic while not tracing). In this case running with export HYPOTHESIS_NO_TRACEBACK_TRIM=1 reveals limited_category_index_cache in cm.query is at fault.
ah-ha, seems like we might want some https://github.com/HypothesisWorks/hypothesis/pull/4029/ - style 'don't cache on backends with avoid_realize=True' logic.
Still here and excited about this! I am on a detour of doing a real symbolic implementation of the decimal module - should get that out this weekend.
Triaging a pile of the Flaky erorrs, most were due to getting a RecursionError under crosshair and then passing under Hypothesis - and it looks like most of those were in turn because of all our nested-@given() test helpers.
So I've tried de-nesting those, which seems to work nicely and even makes things a bit faster by default; and when CI finishes we'll see how much it helps on crosshair 🤞
Looks like string-encoding wants to receive exactly a str, meaning it crashes under crosshair:
from encodings.aliases import aliases
from hypothesis import Verbosity, given, settings, strategies as st
def _enc(cdc):
try:
"".encode(cdc)
return True
except Exception:
return False
lots_of_encodings = sorted(x for x in set(aliases).union(aliases.values()) if _enc(x))
assert len(lots_of_encodings) > 100 # sanity-check
@settings(backend="crosshair", verbosity=Verbosity.verbose)
@given(st.text(), st.sampled_from(lots_of_encodings))
def test_b(string, codec_name):
string.encode(codec_name)
representative traceback:
Trying example: test_b(
string='',
)
Traceback (most recent call last):
File ".../demo.py", line 14, in test_b
string.encode("037")
File ".venv/lib/python3.10/site-packages/crosshair/libimpl/builtinslib.py", line 2691, in encode
return codecs.encode(self, encoding, errors)
File ".venv/lib/python3.10/site-packages/crosshair/libimpl/codecslib.py", line 19, in _encode
(out, _len_consumed) = _getencoder(encoding)(obj, errors)
File ".../3.10.11/lib/python3.10/encodings/cp037.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
TypeError: charmap_encode() argument 1 must be str, not LazyIntSymbolicStr
...
hypothesis.errors.Flaky: Inconsistent results from replaying a failing test case!
last: INTERESTING from TypeError at .../3.10.11/lib/python3.10/encodings/cp037.py:12
this: VALID
It's hard to believe that it's only been a week since I opened this test pr - it's already led to multiple releases of both Crosshair and Hypothesis, as well as a pile of other cleanups to our tests!
Having gone on such a spree of test fixes, random triage, and trying to isolate nice reproducers, I think I'm going to put this down for a while and focus on other things, but @pschanely feel free to ping me or just open your own copy whenever it'd be helpful to see a fresh run with the latest Crosshair updates 🙂
(ok, got a bit nerdsniped...) Digging into this CI run, as the latest and closest-to-passing such we've seen:
-
check-crosshair-custom-cover/test_[a-d]*- times out at 93% (six hours); 9 / 744 tests fail
- actions: try verbose mode or bisection to identify which tests hang?
-
check-crosshair-custom-cover/test_[e-i]*- 10m7s, 61 / 606 tests fail
- actions: waiting for fixes above, then rerun and triage
-
check-crosshair-custom-cover/test_[j-r]*- times out at 94% (six hours); dozens / 1040 tests fail
- actions: verbose or bisection as above
-
check-crosshair-custom-cover/test_[s-z]*- pytest internal error within ~20s, ?? / 689 tests fail
- actions: see traceback below and work out where to insert the
.realize(some_report)call
-
check-crosshair-nocover- 2h 48m 32s (slow!), 99 / 521 tests fail
- actions:
- consider splitting this too, for speed. but it was ~30 minutes before, why??
- reported https://github.com/pschanely/hypothesis-crosshair/issues/18
- waiting for fixes above, then rerun and triage
-
check-crosshair-niche- 12m19s, 2 / ??? tests fail
- failing tests are xpass on Lark; there's still the array-api and numpy tests to go
- actions: set strict=False for those tests, and run array-api and numpy together for efficiency
# pytest internal error on `check-crosshair-custom-cover/test_[s-z]*`
File "_hypothesis_pytestplugin.py", line 329, in pytest_runtest_makereport
("Hypothesis", "\n".join(item.hypothesis_report_information))
TypeError: sequence item 1: expected str instance, LazyIntSymbolicStr found
Got a good traceback for the RecursionError under crosshair 0.0.61:
Traceback (most recent call last):
File ".../crosshair/libimpl/builtinslib.py", line 4271, in _dict
if isinstance(arg, dict):
File ".../crosshair/libimpl/builtinslib.py", line 4461, in _isinstance
return _issubclass(type(obj), types)
File ".../crosshair/libimpl/builtinslib.py", line 4457, in _issubclass
return issubclass(subclass, superclass)
File ".../crosshair/libimpl/builtinslib.py", line 4457, in _issubclass
return issubclass(subclass, superclass)
File ".../crosshair/libimpl/builtinslib.py", line 4457, in _issubclass
return issubclass(subclass, superclass)
[Previous line repeated 1980 more times]
File ".../crosshair/libimpl/builtinslib.py", line 4432, in _issubclass
with NoTracing():
File ".../crosshair/tracers.py", line 431, in NoTracing
return TraceSwap(COMPOSITE_TRACER.ctracer, True)
File ".../crosshair/tracers.py", line 160, in __call__
return self.trace_op(frame, codeobj, opcodenum)
File ".../crosshair/tracers.py", line 163, in trace_op
if is_tracing():
File ".../crosshair/tracers.py", line 427, in is_tracing
return COMPOSITE_TRACER.ctracer.enabled()
RecursionError: maximum recursion depth exceeded while calling a Python object
I don't know what the respective subclass, superclass is but that sure seems like it could do with some loop-detection :-)
Got a good traceback for the
RecursionErrorunder crosshair 0.0.61: I don't know what the respectivesubclass, superclassis but that sure seems like it could do with some loop-detection :-)
Interesting! Can you point me at the hypothesis test that does this? Annnnd, yeah, so the interception framework is supposed to not interfere when a patch calls the function it's patching, but that's obviously not working in this case!
this CI job has 77/92 failures as recursion errors; test_interval_intersection is the one at the end of the logs
👋 just going to dump some other thoughts here quickly, please excuse brevity - more to come on the weekend
- from this CI job it looks like we might have some hangs or very slow progress in
tests/cover/test_deferred_strategies.py; will investigate- just slow:
9316.72s in test_deferred_strategies.py::test_mutual_recursion... but that's very slow
- just slow:
- we seem to have a lot of regex-related failures (I'm so sorry)
-
this job has
test_can_generate_from_all_registered_typesfailing forUnicodeEncodeError,UnicodeTranslateError,Fraction,IPv4Interface,IPv4Network,IPv6Interface,IPv6Network,Rational,PathLike,Match, andslice. - lots of failing tests for
numpyintegration, see logs here
I think the regex-related stuff and LazyIntSymbolicStr are probably the next-highest impact things to fix.
I think the regex-related stuff and
LazyIntSymbolicStrare probably the next-highest impact things to fix.
OK! I'm hoping plugin version 0.0.9 will fix most of the LazyIntSymbolicStr errors. Can look into regexes during the week!
@pschanely I'm overall seeing fewer failing tests (🎉🎉🎉), but also this run just segfaulted maybe in _crosshair_tracers.
Also it might be nice to keep a changelog for hypothesis-crosshair 🙂
@pschanely I'm overall seeing fewer failing tests (🎉🎉🎉), but also this run just segfaulted maybe in
_crosshair_tracers.
Heh. I'm reasonably confident that CrossHair is not thread safe. This also doesn't repro for me immediately, but I'll play around with it.
Also it might be nice to keep a changelog for
hypothesis-crosshair🙂
Haha, yes, it's time.
from test_assume_has_status_reason:
Traceback (most recent call last):
File ".../crosshair/libimpl/builtinslib.py", line 900, in __abs__
return self._unary_op(lambda v: z3.If(v < 0, -v, v))
File ".../crosshair/libimpl/builtinslib.py", line 319, in _unary_op
return self.__class__(op(self.var), self.python_type)
File ".../crosshair/libimpl/builtinslib.py", line 900, in <lambda>
return self._unary_op(lambda v: z3.If(v < 0, -v, v))
TypeError: '<' not supported between instances of 'BoolRef' and 'int'
whereas in Python issubclass(bool, int) and so you can indeed compare them.
Alright - with the fantastic new stuff in crosshair==0.0.64, there are few enough failing tests that I've marked practically all of them as expected failures!
- You can see why they fail with
pytest --hypothesis-profile=crosshair hypothesis-python/tests/cover/ -m xf_crosshair --runxfail(set profile, collect tests, select those with the xf_crosshair marker, ignore xfail marker) - The most-common reasons to seem to be:
(1) returning a symbolic type from
provider.realize(obj)(which is an error, see latest CI), and (2) aRecursionErrorinside Crosshair - there's a representative traceback above I've noted those withWhy.not_realizedandWhy.recursionerrorrespectively at least for most cases to make finding them a bit easier; however some are probably missing because there's a fair bit of nondeterminism involved. - I've opened a couple more issues where there was something obvious to report. I expect that re-triaging the marked tests will continue to yield new issues for quite a while though - and there are also some pretty widely-scoped skips that I put in place for e.g. decimals and numpy-related issues.
This is feeling like incredible progress overall though; imo we've gone from "neat prototype" to "usable alpha/early-beta" over the last few weeks 🤩
most of the currently failing tests look like they might be crosshair issues, cc @pschanely:
-
tests/cover/test_filter_rewriting.py::test_regex_filter_rewriting[binary(min_size=5, max_size=10)-b'ab+c'-match] - crosshair.statespace.NotDeterministic -
tests/nocover/test_explore_arbitrary_languages.py::test_explore_an_arbitrary_language - crosshair.statespace.NotDeterministic -
tests/cover/test_lookup.py::test_resolves_builtin_types[BaseException] - crosshair.util.CrosshairInternal: Possibly transient value found in memo(and[object])
and I'll skip the database test - crosshair just finishes exploring sooner than that test expects 😁
@pschanely huge progress from recent updates! The BackendCannotProceed mechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!), from_type() generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊
At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI. I feel like we've crossed from an alpha-version which is a neat proof of concept, to a beta-version which is still early but already both useful and clearly on a path to stability and wider adoption. Incredibly excited about this ✨
If you want to pull out Crosshair issues,
- this PR is probably useful as a pre-release test, to check whether there are any regressions you didn't expect
- there's a commit marking some things that look like Crosshair bugs to me, and many more where Crosshair just doesn't find a failure that Hypothesis does (within the test budget, and which might or might not be a problem)
- there's a commit full of tests skipped because they were very slow, if you want to look at performance issues. I haven't audited it lately but would guess at least a third are still slow + also Crosshair's problem.
- the last big commit is pretty messy, probably best to ignore that for now
@pschanely huge progress from recent updates! The
BackendCannotProceedmechanism entirely fixed several classes of issues, the floats changes have been great (signed zero ftw!),from_type()generates instances more often, I'm no longer skipping categories of stuff, and overall we've dropped from about +350 to +250 lines of code in this PR 🎊
So great.
At this point my only real reason to avoid merging is that crosshair updates often cause a fair bit of churn, causing some tests to start failing and some to start xpassing - it's net-good, but would be toil in our CI.
Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)
If you want to pull out Crosshair issues,
Always. Thanks for the commit breakdown. More updates soon!
Frankly, I'm not sure it makes sense to block hypothesis on a crosshair-related failure, even in a very distant, stable future. Would love your ideas making the integration more "eventually" correct. Maybe a dedicated testing repo that pulls the hypothesis source and has these pytest markers externally applied? (or submodules? but those scare me)
For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a != requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.
In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)
For clarity, "blocking" would mean 'when we update our pinned dependencies, if Crosshair has changed we'll update the xfail markers accordingly and report any issues upstream, or maybe add a
!=requirement for that version'. Similarly, if a Hypothesis PR doesn't work with Crosshair I'd prefer to learn that at the time so I can decide to either xfail the tests, or do some extra work to support it - and my guess is that the converse would be useful for you too.
Fair enough! I was concerned about how much churn in CrossHair pass/fails you'll see for unrelated hypothesis changes, but it's also true that I want to know about what you see. Current plan SGTM.
In practice I expect I'll just keep updating this PR for now, and you can grab a local copy of the branch if you want to run the tests before a Crosshair release 😁 (and note the test-selection tips at the top of the pr!)
Yup! I've been doing this a little already; works for me.
@Zac-HD I've been looking into getting this rebased against master, and I think there are at least some mainline changes that are affecting the tests. I am able to do some early triage, but hoping that you or @tybug can assist with the resolution. Would that be ok? And, do we want to work through things here? Alternatively, I guess I could be opening actual hypothesis issues saying "hey, I think this test X should work under the crosshair profile and here's why..."
Confirmed that we regressed crosshair at some point:
@given(st.floats(min_value=0))
@settings(backend="crosshair")
def f(xs):
pass
f()
...
File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/engine.py", line 1540, in cached_test_function
result = check_result(data.as_result())
^^^^^^^^^^^^^^^^
File "/Users/tybug/Desktop/Liam/coding/hypothesis/hypothesis-python/src/hypothesis/internal/conjecture/data.py", line 2370, in as_result
assert self.frozen
^^^^^^^^^^^
AssertionError
Will investigate (but not sure I'll have time today specifically). I think this is almost certainly our fault, not crosshair.
IMO initial triage in here is best, with the intent to only open an issue if we expect a fix to take longer than ~days.
https://github.com/HypothesisWorks/hypothesis/pull/4230 will fix the above issue!
I'm now leaning towards merging this onto master - we've got it almost-entirely-working, and have (I think correctly) used the word "regression" to describe changes which made it work less well. So having CI to let us know about those as they happen seems pretty valuable to me!
(though who knows when I'll have a day free to get this back up to date again 😅)