hypothesis icon indicating copy to clipboard operation
hypothesis copied to clipboard

Explaining failing examples - by showing which arguments (don't) matter

Open Zac-HD opened this issue 3 years ago • 1 comments

Hypothesis has many features designed to help users find bugs - but helping users understand bugs is equally important! Our headline feature for that is shrinking, but I think we should treat minimal failing examples as a baseline[^1]. That's why I implemented basic fault-localization in explain mode, and want to take that further by generalizing failing examples.

One key insight here is that the feature should be UX-first, defined by the question "what output would help users understand why their test failed"[^2]. The approach I've chosen amounts to:

  1. Shrink to a minimal failing example,
  2. Determine which arguments can be freely varied without changing the failure, and
  3. Print a comment like # or any other generated value next to each such argument.

Of these, the difficult part is modifying the conjecture internals for (2):

  • Identify the span corresponding to each argument to @given
  • Replay up the start of that span, use new random bits within it, and replay the suffix after the span (using some new ConjectureData internals)
  • Track which arguments ever failed to reproduce the failure. Optimization: check which previously-executed examples met the criteria and count them towards the analysis.
  • We'll have a distinct comment for "varying all these repros" and otherwise just report "varying these one-at-a-time repros". Trying to report subsets is confusing, expensive to compute, and not that useful.

This approach is coarser-grained than the prior art (see https://github.com/HypothesisWorks/hypothesis/issues/2192), but conversely can be used with data than does not match a context-free grammar. On the whole, I like it much more 🙂

[^1]: not least because the threshold problem can make failures look less important, e.g. https://github.com/HypothesisWorks/hypothesis/issues/2180 [^2]: rather than e.g. "what cool algorithmic or instrumentation trick could I pull?", as is considerably more common.

Zac-HD avatar Jul 17 '22 07:07 Zac-HD

A simple but concrete example to illustrate:

from hypothesis import given, strategies as st

@given(st.integers(), st.integers())
def test_division(x, y):
    x / y

Currently reports:

Falsifying example: test_division(
    y=0, x=0,
)

Desired report:

Falsifying example: test_division(
    x=0,  # or any other generated value
    y=0,
)

Zac-HD avatar Jul 17 '22 07:07 Zac-HD

I have a working prototype! It only shows comments if the end of the buffer can vary, but I've plumbed everything though and handling arbitrary segments shouldn't be much harder - just need to work out when to start replaying the saved suffix. Still very exciting to see 🎉

update: complete working implementation at https://github.com/Zac-HD/hypothesis/compare/creation-reprs...which-parts-matter

Zac-HD avatar Jan 09 '23 12:01 Zac-HD