Explaining failing examples - by showing which arguments (don't) matter
Hypothesis has many features designed to help users find bugs - but helping users understand bugs is equally important! Our headline feature for that is shrinking, but I think we should treat minimal failing examples as a baseline[^1]. That's why I implemented basic fault-localization in explain mode, and want to take that further by generalizing failing examples.
One key insight here is that the feature should be UX-first, defined by the question "what output would help users understand why their test failed"[^2]. The approach I've chosen amounts to:
- Shrink to a minimal failing example,
- Determine which arguments can be freely varied without changing the failure, and
- Print a comment like
# or any other generated valuenext to each such argument.
Of these, the difficult part is modifying the conjecture internals for (2):
- Identify the span corresponding to each argument to
@given - Replay up the start of that span, use new random bits within it, and replay the suffix after the span (using some new
ConjectureDatainternals) - Track which arguments ever failed to reproduce the failure. Optimization: check which previously-executed examples met the criteria and count them towards the analysis.
- We'll have a distinct comment for "varying all these repros" and otherwise just report "varying these one-at-a-time repros". Trying to report subsets is confusing, expensive to compute, and not that useful.
This approach is coarser-grained than the prior art (see https://github.com/HypothesisWorks/hypothesis/issues/2192), but conversely can be used with data than does not match a context-free grammar. On the whole, I like it much more 🙂
[^1]: not least because the threshold problem can make failures look less important, e.g. https://github.com/HypothesisWorks/hypothesis/issues/2180 [^2]: rather than e.g. "what cool algorithmic or instrumentation trick could I pull?", as is considerably more common.
A simple but concrete example to illustrate:
from hypothesis import given, strategies as st
@given(st.integers(), st.integers())
def test_division(x, y):
x / y
Currently reports:
Falsifying example: test_division(
y=0, x=0,
)
Desired report:
Falsifying example: test_division(
x=0, # or any other generated value
y=0,
)
I have a working prototype! It only shows comments if the end of the buffer can vary, but I've plumbed everything though and handling arbitrary segments shouldn't be much harder - just need to work out when to start replaying the saved suffix. Still very exciting to see 🎉
update: complete working implementation at https://github.com/Zac-HD/hypothesis/compare/creation-reprs...which-parts-matter