Diagnose slow hypothesis tests
We're actually running low on important semantic issues, so I'd probably turn next to performance issues - there are some very slow tests that I've just skipped because they take several minutes, or hours, or don't finish at all within the 6h Actions time limit. Current commit with those is https://github.com/HypothesisWorks/hypothesis/pull/4034/commits/bb432101541c618762b1024b10e277658105baf8
Originally posted by @Zac-HD in https://github.com/pschanely/CrossHair/issues/292#issuecomment-2289529225
@Zac-HD Done with one round of investigations and fixes! (hypothesis-crosshair 0.0.13 and crosshair-tool 0.0.68) Breaking up the tests into categories:
minimize() tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have _current_build_context.value == None under the crosshair testing profile. Affected tests:
test_can_generate_both_zeros_when_in_intervaltest_mutual_recursiontest_hashable_type_unhashable_value- many tests in
tests/nocover/test_simple_numbers.py
Nested tests (used to time out, but now CrossHair crashes fast):
test_nesting_with_control_passes_health_check
Just plain better, due to recent updates:
test_drawing_many_near_boundary(though still slowish)test_lookup_knows_about_all_core_strategiestest_fullmatch_generates_example
This gets a solver unsat here, ideally this will raise hypothesis.errors.BackendCannotProceed once that's available ... and would keep trying.
test_hard_to_find_single_example
For me to continue to investigate:
- tests in
tests/nocover/test_health_checks.py- not sure I'll be able to gettoo_large_strategyto work, but will poke at it a little. The other tests complete after ~minutes. - tests in
tests/nocover/test_type_lookup_forward_ref.pyno longer times out, but now yields a strangeSimpleDict object is not callableerror.
Sweet! I'll update shortly and see how it goes 😁
minimize()tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have_current_build_context.value == Noneunder the crosshair testing profile.
That skip is specific to running minimal(...) _inside @given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too: minimal() has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through the max_examples=50_000 (!!!) iterations.
My action items:
- [x] turn that down globally, maybe even lower for crosshair, mark affected tests with
Why.undiscoveredto investigate later - [x] unskip tests we expect are fixed now, update deps, and trigger CI
- [x] maybe hack together a spike for
hypothesis.errors.BackendCannotProceed?
That skip is specific to running
minimal(...)_inside@given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too:minimal()has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through themax_examples=50_000(!!!) iterations.
Ah! I get it now. So I'll continue to look at those too.
Very hacky, but... done. Well, maybe, I'll see what CI thinks after I sleep 😄
OK, there are quite a few failures but I did mess around with the internals a bit and overall it's looking like an improvement. Quick triage:
- [x] (Zac) pull out and merge the cleanups to
minimal()etc https://github.com/HypothesisWorks/hypothesis/pull/4090 - [x] (Zac) extract and test
BackendCannotProceedchanges to a separate PR - [x] (Phillip) fix
TypeError: unhashable type: 'tzutc'('Rule', etc) regressions, failing on Crosshair backend but not Hypothesis - [ ] (Phillip)
Unsatisfiable: Could not find any examples from floats() that satisfied lambda x: x + 1 == x and not math.isinf(x) - [x] (???) strategy reprs fail to omit default values under crosshair, e.g. here, unclear where that should be fixed. Might be me in
LazyStrategy.__repr__due toif k not in sig.parameters or v is not sig.parameters[k].default?
I got sidetracked on a few subtle-but-important issues this week, but this is still a big priority! I'll be looking at these in the next few days.
0.0.70 fixes that ugly regression with unhashables!
The following two issues are going to take me a while, but, honestly, might legitimately be the next things to look at.
- [ ] (Phillip)
Unsatisfiable: Could not find any examples from floats() that satisfied lambda x: x + 1 == x and not math.isinf(x)
I'll need to tackle the issue of real float semantics for this one. High value, ~high effort.
- [ ] (???) strategy reprs fail to omit default values under crosshair, e.g. here, unclear where that should be fixed. Might be me in
LazyStrategy.__repr__due toif k not in sig.parameters or v is not sig.parameters[k].default?
Your guess is right - the failing int bound is a symbolic. I have a branch on the backburner for more accurate identity semantics. I hadn't thought about faithful integer intern'ing as part of that (the biggest wins relate to avoiding path forks on enums and optional values), but I think I could make it happen.
As for whether the identity check on the hypothesis side is objectively correct, I'm unsure. Integer intern'ing isn't guaranteed across implementations. But in order for your test to fail, I think integer intern'ing would have to switch up mid-execution, which seems pretty unlikely.
For the identity check, I'm not (deliberately) checking interned integers; but rather whether the value is the default argument - so that we show even the default value if it was explictly passed. Integers usually defeat that, but after a quick investigation I've classified this as "not expected to pass under crosshair" so we're done.
Hypothesis now tests against Crosshair in our own CI runs, which will catch breaking changes on the Hypothesis side before we merge the PR, and on the Crosshair side from weekly dependency updates 🎉
We do still have ~20 tests which are skipped for performance reasons though; you can find them by searching for settings._current_profile == "crosshair". The relevant commit is https://github.com/HypothesisWorks/hypothesis/commit/86e94ae0aead812d91c01628fe04a5965cf6f66b, if that helps.
Hypothesis now tests against Crosshair in our own CI runs, which will catch breaking changes on the Hypothesis side before we merge the PR, and on the Crosshair side from weekly dependency updates 🎉
SO happy about this!
We do still have ~20 tests which are skipped for performance reasons though; you can find them by searching for
settings._current_profile == "crosshair". The relevant commit is HypothesisWorks/hypothesis@86e94ae, if that helps.
Yup, perfect; I will find some time to review, but maybe not until next week. At the very least, it would be nice to be able to point to some hypothesis or crosshair ticket, or an explanation wbout why it should be slow on CrossHair.