hypothesis Further improve support for symbolic execution

Following https://github.com/HypothesisWorks/hypothesis/issues/3086 and https://github.com/HypothesisWorks/hypothesis/pull/3806, you can pip install hypothesis[crosshair], load a settings profile with backend="crosshair", and Z3 will use the full power of an SMT solver to find bugs! ✨

...but seeing as this is a wildly ambitious research project, there are probably a lot of problems in our implementation, not just the code we're testing. That's why we marked the feature experimental; and this issue is to track such things so that we can eventually fix them.

[x] upgrade our observability reports to note which backend was in use for each test case (how_generated), if we're configured to use a non-default backend. See also #3845.
[x] auto-update the version bounds for the crosshair extra, like we do for tzdata
[x] propose integrating hypothesis.target() so we can experiment with Z3 for optimization https://github.com/pschanely/hypothesis-crosshair/issues/3
[x] run lots of property-based tests using the crosshair backend, to check for internal errors (it's OK if it's unhelpful)
- test suites to try
  - [x] all of our own tests; some failures are from testing implementation details but they're right there (./build.sh check-crosshair-cover -- -Wignore -x to run until first failure, and there are plenty of them)
    - [x] see many failures in https://github.com/HypothesisWorks/hypothesis/pull/4034
    - [x] run them in CI, once they're working
  - [x] hypothesis-jsonschema - see https://github.com/pschanely/hypothesis-crosshair/issues/2 for updates
- known issues
  - [x] crosshair-tool pins to an old version of z3-solver, which raises DeprecationWarning: pkg_resources is deprecated as an API. Fixed in https://github.com/Z3Prover/z3/pull/6949 but upgrading is presumably annoying - see https://github.com/pschanely/CrossHair/issues/248.
  - [x] internal assertion failure in Hypothesis, ConjectureData object not frozen before calling .as_result() (seems like this might have been due to a set error now fixed in crosshair)
  - [x] https://github.com/pschanely/hypothesis-crosshair/issues/1 - ContractRegistrationError from monkeypatching the time module (see comment)
  - [x] json serialization error with HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY=1 might require changes to the materialization hook? See https://github.com/HypothesisWorks/hypothesis/issues/3914#issuecomment-2045735039.
[ ] some final integration cleanups
- [x] diagnose very slow tests https://github.com/pschanely/CrossHair/issues/301 (some may already work?) [done in #4334]
  - [ ] https://github.com/pschanely/CrossHair/issues/332, then remove skips
  - [ ] https://github.com/pschanely/hypothesis-crosshair/issues/27, then remove skips
- [ ] investigate xfailed tests
  - [ ] Why.symbolic_outside_context is expected but we might be able to adjust some tests to avoid it
  - [ ] Why.nested_given we can raise a more helpful error on entering the inner-given if backend="crosshair", regardless of the suppress_health_check setting
  - [ ] Why.other is varied; investigate and maybe recategorize. Add issue links as relevant.
  - [ ] Why.undiscovered is mostly for Crosshair to investigate: https://github.com/HypothesisWorks/hypothesis/issues/3914#issuecomment-2746799544
[ ] now that we've got the basics down, consider ambitious new features
- [ ] solving 'hard branches' for fuzzing, by using crosshair to solve for a a path reaching some specific additional constraint.
  - [ ] easy mode: just create a high-coverage seed corpus; we can already mostly do this with Hypothesis' default blackbox backend, and then warm-start the solver: https://github.com/pschanely/hypothesis-crosshair/issues/26
- [ ] "assimilating" examples: go from an end-user-reported failing input, back to the IR using Crosshair, and then we can shrink and explain it with all Hypothesis' normal tools.
  - [ ] Crosshair supporting target() might give us basically all the tools we need for this in cases with a clear distance metric or metrics; otherwise I guess we might want to drive event() to take on each distinct possible value and could hit both values of gen == reported. https://github.com/pschanely/hypothesis-crosshair/issues/3
- [ ] Prove that the test always passes for certain subsets of possible inputs. Could be a nice efficiency win, needs deeper Provider integration to set/get the .is_exhausted attribute from non-Hypothesis backends.

Mar 10 '24 05:03 Zac-HD

Run with pytest -Wignore to avoid the z3 deprecation warning:

from hypothesis import given, settings, strategies as st

@settings(backend="crosshair")
@given(st.text("abcdefg"))
def test_hits_internal_assert(x):
    assert set(x).issubset(set("abcdefg"))

Mar 10 '24 05:03 Zac-HD

I'll start trying out hypothesis-jsonschema next!

Mar 15 '24 19:03 pschanely

@pschanely I'm getting an exception when trying to serialize arguments for observability mode:

# Execute with `HYPOTHESIS_EXPERIMENTAL_OBSERVABILITY=1 pytest -Wignore t.py`
from hypothesis import given, settings, strategies as st

@settings(backend="crosshair", database=None, derandomize=True)
@given(st.integers())
def test_hits_internal_assert(x):
    assert x % 64 != 4

Initially I thought that this was because we're only calling the post-test-case hook when the test function raised an exception, but patching that (https://github.com/HypothesisWorks/hypothesis/compare/master...Zac-HD:hypothesis:post-tc-observability-hook) still results in the same error:

  File "python3.10/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type LazyIntSymbolicStr is not JSON serializable

Is there some reasonable way to support this? It seems possible-in-principle to serialize it out, once the object (or root nodes of the tree that it's derived from) have been materialized, but you'd have a better sense of that than I. Worst case, we can extend our internal to_jsonable() function to handle it somehow.

Apr 08 '24 20:04 Zac-HD

Ah, so I think this testcase dictionary can contain values derived from symbolics: at least the "representation" and "arguments" keys? In general it's not OK to touch data derived from symbolics outside the per-run context manager; I'm not sure which solutions are feasible here. One weird thing to note about my plugin's post_test_case_hook - after the context manager exits, it can only produce realized values for the drawn primitives. But if you call it inside the per-test-case context manager, it can (deeply) realize any value. So if we could somehow call that on the JSON blob that we want to write out to the observability file, I think we'd be good. (I have no clue how hard that would be)

Aside: it looks to me like the "representation" string is computed before running the code under test, so it's good that it comes out symbolic. But even the generation of the string will force us into a variety of early decisions - crosshair with observability will behave very differently than without it. I assume not a big deal at this stage.

Apr 09 '24 13:04 pschanely

Ah, so I think this testcase dictionary can contain values derived from symbolics: at least the "representation" and "arguments" keys?

Yep, that's right - the representation and arguments values are very directly derived from the arguments, and the features and metadata values may derive at least in part from the arguments too, via e.g. args to hypothesis.target(). The coverage (and timing) values only depend on which branches were executed, not the values, so I think they're concrete like everything else.

In general it's not OK to touch data derived from symbolics outside the per-run context manager; I'm not sure which solutions are feasible here. One weird thing to note about my plugin's post_test_case_hook - after the context manager exits, it can only produce realized values for the drawn primitives. But if you call it inside the per-test-case context manager, it can (deeply) realize any value. So if we could somehow call that on the JSON blob that we want to write out to the observability file, I think we'd be good. (I have no clue how hard that would be)

Rather than the json blob itself, I think we'll aim to deep-realize the various dictionaries that become values in the json blob - that's mostly just a matter of adding and moving some hook calls. I think your current hook implementation can do this, but if we use it this way we'll need to update expectations for other libraries implementing the hook (which is fine tbc).

It might also be useful to mark the end of the "region of interest" while the context is still open; I don't want to spend time exploring values which differ only in how we do post-test serialization!

Aside: it looks to me like the "representation" string is computed before running the code under test, so it's good that it comes out symbolic. But even the generation of the string will force us into a variety of early decisions - crosshair with observability will behave very differently than without it. I assume not a big deal at this stage.

We pre-compute this so that we can still show an accurate representation even if the test function mutates its arguments or crashes in some especially annoying way, but it seems reasonable to defer this on backends which define the post_test_case hook. On the other hand it seems like this doesn't intrinsically need to force any early decisions, if we don't touch the string until later?

Apr 09 '24 17:04 Zac-HD

Rather than the json blob itself, I think we'll aim to deep-realize the various dictionaries that become values in the json blob - that's mostly just a matter of adding and moving some hook calls. I think your current hook implementation can do this, but if we use it this way we'll need to update expectations for other libraries implementing the hook (which is fine tbc).

It might also be useful to mark the end of the "region of interest" while the context is still open; I don't want to spend time exploring values which differ only in how we do post-test serialization!

This is a good point. Thinking about it again, I think the ideal interface would be something where I could give ancillary context managers for wherever you manipulate symbolics - and I'd make it so that this manager would not play a part in the path search tree if the main function has completed. We'd use this to guard both the construction of the testcase data and the JSON file write. FWIW, the construction of the representation string happens to work right now, but seemingly trivial changes on either side could cause that to break if I don't have the right interpreter hooks in place. I could look into this over the weekend if we wanted.

We pre-compute this so that we can still show an accurate representation even if the test function mutates its arguments or crashes in some especially annoying way, but it seems reasonable to defer this on backends which define the post_test_case hook. On the other hand it seems like this doesn't intrinsically need to force any early decisions, if we don't touch the string until later?

First step is probably just to get things not crashing. Longer term, I think observability is honestly pretty interesting from a crosshair perspective, but probably only if it doesn't change the path tree. That won't happen under the current setup; although we have a symbolic string, it's will likely have a realized length , so we're exploring paths with early decisions like "integers in the 100-199 range."

Apr 12 '24 13:04 pschanely

I'm likely to dig into this (or support someone else to) at the PyCon sprints in May; so long as we're ready by then there's no rush for this weekend.

And yeah, the length thing makes sense as an implementation limitation. I guess doing a union of lengths just has too much overhead?

Apr 12 '24 16:04 Zac-HD

And yeah, the length thing makes sense as an implementation limitation. I guess doing a union of lengths just has too much overhead?

Most of the meaningful decisions in CrossHair are about the tradeoff between making the solver work harder vs running more iterations. Sequence solving tanks the solver pretty fast, and it's largely avoided in the current implementation. In its ultimate form, CrossHair would have several strategies and be able to employ them adaptively (a single length, a bounded union of specific lengths, the full sequence solver, ...); but I'm not there yet.

Regardless, it would be foolish of me to try and guarantee that certain kinds of symbolic operations will never introduce a fork in the decision tree. 😄

Apr 12 '24 18:04 pschanely

This is a good point. Thinking about it again, I think the ideal interface would be something where I could give ancillary context managers for wherever you manipulate symbolics - and I'd make it so that this manager would not play a part in the path search tree if the main function has completed. We'd use this to guard both the construction of the testcase data and the JSON file write.

I'm likely to dig into this (or support someone else to) at the PyCon sprints in May; so long as we're ready by then there's no rush for this weekend.

Ok, I'm not positive about whether this is really the right strategy, but in v0.0.4, I've added a post_test_case_context_manager() in addition to the existing per_test_case_context_manager(); my hope is that we can use this whenever we want to perform operations involving potential symbolics. (I am unclear on how feasible would be to employ on the hypothesis side though)

You cannot enter the post_test_case_context_manager inside the per_test_case_context_manager currently. I could permit this if we want, but I'd call out that decisions made in those cases will still grow the search tree.

May 13 '24 13:05 pschanely

If I can make a small UX suggestion, it would be great if the choice of backend can be made by config variable (with pytest flag as a really nice to have) instead of having to set it with @settings e.g.

$ pytest ... --hypothesis-backend crosshair ...
# OR
$ HYPOTHESIS_BACKEND=crosshair pytest ...

Personally I have done a lot of work with both fuzzing and formal verification, and what works very well for me is to use fuzzing for prototyping property test cases (or otherwise making code changes) as it's fast/defined-timeline/quick feedback, and a great way to find easy bugs quickly, then pivot to a formal verification backend without changing the test cases to try and validate the properties in a much stronger way (which usually takes much longer to execute and might be done in CI)

Still seems early days for this feature, but as someone who will likely expose this to developers (who may not know how to use formal proving effectively themselves), I am very very excited for what it can be used for

Jun 15 '24 15:06 fubuloubu

You can already do this for yourself using our settings profiles functionality and a small pytest plugin in conftest.py! We should consider upstream support of some kind once it's stable too 🙂

Jun 15 '24 21:06 Zac-HD

@Zac-HD crosshair is finally on the latest z3! I think we may also need some sort of issue for figuring out what to do with settings; I think crosshair too easily runs afoul of HealthCheck.TOO_SLOW and deadline limits.

@fubuloubu I've added an explicit recipe in the crosshair-hypothesis readme for running all tests under crosshair. BTW, if you've tried it, I'd also love to hear from you directly about what is (and isn't) working for you!

Jun 20 '24 12:06 pschanely

Plausibly we could disable all timing-related health checks when running with other backends, or provide an api for backends to (1) opt out of health checks (2) modify them by changing their limits.

Jun 20 '24 15:06 tybug

Instead of more interaction between settings, can we just add those to the example configuration?

Jun 20 '24 17:06 Zac-HD

Instead of more interaction between settings, can we just add those to the example configuration?

Yup, I think that's fine, especially as we're getting started.

Jun 21 '24 15:06 pschanely

Per https://github.com/pschanely/hypothesis-crosshair/issues/21, let's add a special exception - e.g. hypothesis.errors.BackendCannotProceed - which backends can raise to signal that they can't finish the current example. This should be treated like UnsatisfiedAssumption, but importantly it can be raised at any time we interact with the backend PrimitiveProvider - including e.g. when trying to realize a failing example.

WIP over in https://github.com/HypothesisWorks/hypothesis/pull/4092

Aug 09 '24 02:08 Zac-HD

@pschanely you may also want to investigate the Why.undiscovered tests, marked mostly in https://github.com/HypothesisWorks/hypothesis/commit/755486ab9ea6a5c56962f20474b92201714a489d - it's a collection of cases where Hypothesis' heuristics find a failing example, but Crosshair doesn't - or at least doesn't reliably.

Mar 24 '25 03:03 Zac-HD

@pschanely you may also want to investigate the Why.undiscovered tests, marked mostly in 755486a - it's a collection of cases where Hypothesis' heuristics find a failing example, but Crosshair doesn't - or at least doesn't reliably.

I made a pass through the undiscovered ones in #4336!

Of possible interest on the hypothesis side - sometimes, a symbolic value finds its way into a SampledFromStrategy. Then, we try to calculate a "label" (used for spans and maybe elsewhere), which attempts to hash the symbolic, realizing it. I'm not sure whether there's an easy way to avoid this.

Apr 02 '25 13:04 pschanely

hypothesis hypothesis copied to clipboard

Further improve support for symbolic execution

hypothesis
hypothesis copied to clipboard