hypothesis Explaining failing examples - by showing which arguments (don't) matter

Hypothesis has many features designed to help users find bugs - but helping users understand bugs is equally important! Our headline feature for that is shrinking, but I think we should treat minimal failing examples as a baseline[^1]. That's why I implemented basic fault-localization in explain mode, and want to take that further by generalizing failing examples.

One key insight here is that the feature should be UX-first, defined by the question "what output would help users understand why their test failed"[^2]. The approach I've chosen amounts to:

Shrink to a minimal failing example,
Determine which arguments can be freely varied without changing the failure, and
Print a comment like # or any other generated value next to each such argument.

Of these, the difficult part is modifying the conjecture internals for (2):

Identify the span corresponding to each argument to @given
Replay up the start of that span, use new random bits within it, and replay the suffix after the span (using some new ConjectureData internals)
Track which arguments ever failed to reproduce the failure. Optimization: check which previously-executed examples met the criteria and count them towards the analysis.
We'll have a distinct comment for "varying all these repros" and otherwise just report "varying these one-at-a-time repros". Trying to report subsets is confusing, expensive to compute, and not that useful.

This approach is coarser-grained than the prior art (see https://github.com/HypothesisWorks/hypothesis/issues/2192), but conversely can be used with data than does not match a context-free grammar. On the whole, I like it much more 🙂

[^1]: not least because the threshold problem can make failures look less important, e.g. https://github.com/HypothesisWorks/hypothesis/issues/2180 [^2]: rather than e.g. "what cool algorithmic or instrumentation trick could I pull?", as is considerably more common.

Jul 17 '22 07:07 Zac-HD

A simple but concrete example to illustrate:

from hypothesis import given, strategies as st

@given(st.integers(), st.integers())
def test_division(x, y):
    x / y

Currently reports:

Falsifying example: test_division(
    y=0, x=0,
)

Desired report:

Falsifying example: test_division(
    x=0,  # or any other generated value
    y=0,
)

Jul 17 '22 07:07 Zac-HD

I have a working prototype! It only shows comments if the end of the buffer can vary, but I've plumbed everything though and handling arbitrary segments shouldn't be much harder - just need to work out when to start replaying the saved suffix. Still very exciting to see 🎉

update: complete working implementation at https://github.com/Zac-HD/hypothesis/compare/creation-reprs...which-parts-matter

Jan 09 '23 12:01 Zac-HD