hypothesis
hypothesis copied to clipboard
Migrate our core representation to an IR layer
This epic-style issue tracks our work on refactoring Hypothesis to use an IR layer in our engine.
Motivation
So far, most things in Hypothesis have been built to work at the level of a bytestream.
- Strategies draw bits from this bytestream to make choices or construct values while producing a return value (and in doing so "interpret the bytestream as a source of randomness", as the quote goes).
- Inputs to a test function are represented internally as the bytestream that, when supplied to the test function, would generate that input.
- Correspondingly, the database stores inputs as their bytestream representation.
-
DataTree
, which tracks what inputs we have previously tried in order to avoid redundancy, works at the level of blocks — logically related continuous segments of bytes, e.g. perhaps from the same strategy. - The shrinker tries to find the lexicographically (
"" < "0" < "1" < "00" < "01" < "11"
) smallest bytestream which is still a counterexample.
However, in many cases, a bytestream is too low-level of a representation to make intelligent decisions.
- For many strategies, the mapping of bytestream ↦ input is not injective, so the same input may have multiple bytestream representations.
DataTree
sees these as distinct inputs and can't deduplicate them. Ever wondered why we try0
so many times for@given(st.integers())
? It's not because we want to!- This is the case for anything that requires rejection sampling. In particular, for drawing a (biased aka p ≠ 0.5) boolean, which is something we do extensively internally.
- See #1574 for a manifestation.
- The shrinker has limited knowledge of the context of the bytestream it is shrinking. We do our best to give hints, for example by denoting subsets of the bytestream (called examples) as coming from a particular strategy, but it is easy for the shrinker to try inputs which are invalid and hard for the shrinker to make context sensitive shrinks.
In a completely unrelated train of thought, we would like Hypothesis to support backends: the ability to specify a custom distribution over strategies, overriding Hypothesis' pseudo-randomness. The original motivation here was supporting CrossHair (#3086), a concolic execution tool — but many other such backends are possible. (I personally have some ideas).
Happily, we can address both of these concerns with the same refactoring. That refactoring is migrating much of Hypothesis, which currently operates on bytestreams, to instead operate on an IR layer.
IR
The IR will be comprised of five types: boolean
, integer
, float
, string
, and bytes
. The full interface for the IR is as follows:
class PrimitiveProvider(abc.ABC):
@abc.abstractmethod
def draw_boolean(
self,
p: float = 0.5,
) -> bool:
...
@abc.abstractmethod
def draw_integer(
self,
min_value: int | None = None,
max_value: int | None = None,
*,
# weights are for choosing an element index from a bounded range
weights: Sequence[float] | None = None,
shrink_towards: int = 0,
) -> int:
...
@abc.abstractmethod
def draw_float(
self,
*,
min_value: float = -math.inf,
max_value: float = math.inf,
allow_nan: bool = True,
smallest_nonzero_magnitude: float,
) -> float:
...
@abc.abstractmethod
def draw_string(
self,
intervals: IntervalSet,
*,
min_size: int = 0,
max_size: int | None = None,
) -> str:
...
@abc.abstractmethod
def draw_bytes(
self,
min_size: int = 0,
max_size: int | None = None,
) -> bytes:
...
All strategies will draw from these five functions at the lowest level level, rather than from a bytestream. From this, we get better DataTree
deduplication (the mapping for arbitrary strategies is still not guarantee to be injective, but it's much closer!), more intelligent shrinking, and backend support.
One needs only to implement these five methods to receive native hypothesis support for shrinking, database, targeted PBT, and anything else.
original IR design described here https://github.com/HypothesisWorks/hypothesis/issues/3086#issuecomment-1774233444, though some small interface details have since changed.
Implementation
Completed:
- initial refactorings
- #3788
- #3801
- #3818
- #3806
- #3899
- #3962
- #4007 (+ migrate
generate_mutations_from
) - #4097
- https://github.com/HypothesisWorks/hypothesis/pull/4138
Ongoing work, roughly in order of expected completion:
- [ ] finish migrating the shrinker
- [ ] migrate
Optimiser
(used bytarget()
) - [ ] migrate inquisitor (
explain
phase), see https://github.com/HypothesisWorks/hypothesis/issues/3864 - [ ] migrate
ParetoOptimiser
- [ ] migrate database to serialized ir instead of buffers (https://github.com/HypothesisWorks/hypothesis/compare/master...Zac-HD:hypothesis:ir-serializer)