steering-council Action items: adopting property-based testing in CPython's CI

Action items: adopting property-based testing in CPython's CI

Open Zac-HD opened this issue 3 years ago • 19 comments

At the 2021 Language Summit - like in 2020 - I proposed that CPython should use Hypothesis in CI and increase its use of fuzzing (via OSS-Fuzz). Core devs were generally keen, and so @brettcannon asked me to open an issue here to track the action items. As listed in my slides, that's:

Merge https://github.com/python/cpython/pull/22863, by @pganssle, allowing addition of further property-based tests.
Run property-based tests in CI - blocking merge on failure, since they'd be mostly ignored otherwise.

The main question is how exactly to get Hypothesis installed into the CI environment. @zooba was concerned about depending on being able to pip install things before we even run core tests, and suggested vendoring Hypothesis. I'm not implacably opposed to vendoring, but hopefully Paul's shims-approach plus pinning the version might already address this concern?

Tests for the parser or tokenize module would also pull in dependencies on hypothesmith, plus transitive dependencies on Lark and LibCST. This would be much more elegant to handle with shims than by vendoring, and particularly valuable given the history of Hypothesis-discovered parser bugs (e.g. BPO-40661, BPO-38953, and BPO-42218).

I also proposed cherry-picking from https://github.com/Zac-HD/stdlib-property-tests to test the parser and expanding CPython's use of OSS-Fuzz. However, I think that once the Steering Council has unblocked the first property-based tests for CI, further work can proceed on the usual issue-by-issue or PR-by-PR basis. I will of course be happy to help with this in whatever way I can.

May 13 '21 09:05 Zac-HD

Two other alternatives to 2:

Only run the hypothesis tests on a buildbot, but run the stubs on CI (as we're doing in python/cpython#22863)
Add the hypothesis CI run as an optional (non-blocking) test, with the stubbed-out tests still in the "blocking" path.

The way I see it, if we write the property-based tests with lots of @example decorators and continue to add @examples each time one of the full hypothesis runs fails, the result of the stub-only hypothesis runs should be increasingly likely to match the hypothesis buildbot runs. When building out new hypothesis tests, it should be a relatively simple matter for a user to run the tests locally with hypothesis and iteratively add failing examples.

May 13 '21 14:05 pganssle

I have some concerns about Hypothesis. I worry that there is a rather hard "sell" going on here. Can we perhaps have an open discussion on the merits of Hypothesis itself on python-dev before committing to this? It seems to promote a coding style that's more functional than is good for Python, and its inner workings seem too magical to me. Also the decorator-based DSL looks pretty inscrutable to me.

May 13 '21 17:05 gvanrossum

Can we perhaps have an open discussion on the merits of Hypothesis itself on python-dev before committing to this?

You mean hypothesis specifically or property-based testing in general? I think that if we add property-based testing, hypothesis is probably the only natural choice. Of course, I can make this case on the thread if that was your intention.

It seems to promote a coding style that's more functional than is good for Python, and its inner workings seem too magical to me. Also the decorator-based DSL looks pretty inscrutable to me.

Do you mean it's promoting a coding style within the tests, or in the public API? When designing zoneinfo I didn't think about how easy it would be to test with Hypothesis at all as part of the API and it was basically frictionless.

I think there are situations where it makes sense to write functional-style strategies because it can in some situations give hypothesis more information about how the strategies are transformed (and thus allow it to optimize how the data is drawn rather than drawing from the full set and discarding a bunch of stuff that doesn't match, or doing a bunch of extra logic on each draw), but it's by no means required. I'd say most of my "complex strategies" are decorated functions rather than chains of maps and filters.

From the stdlib-property-tests repo, this seems like the most complicated strategy I'm seeing employed, and I find it fairly simple to understand, and I don't know how easy to read any code would be that generates trees of arbitrary depth containing objects with specific properties.

May 13 '21 17:05 pganssle

Let's have this discussion elsewhere, not in the Steering Council's tracker.

May 13 '21 18:05 gvanrossum

Discussion has continued on BPO-42109.

May 20 '21 06:05 Zac-HD

FWIW, I have a better understanding now of what Hypothesis does and how. I withdraw my opposition against Hypothesis insofar as it's about how it works. I leave it up to the SC to consider the other issues that have been brought up, largely around having a 3rd party dependency in the core tests. (Zac, thanks for your patience with me and for your help in understanding Hypothesis' approach).

May 24 '21 15:05 gvanrossum

@brettcannon - this seems to be some combination of stuck, forgotten, or low-priority. Is there anything I can do to help out, instead of coming back for another presentation in 2022?

Aug 11 '21 04:08 Zac-HD

@Zac-HD get people to stop submitting PEPs or asking other stuff of us? 😉

We haven't forgotten about it, but we already were working from a backlog when this was submitted (e.g. #63 and that whole discussion predates this request which we have not resolved yet).

Aug 11 '21 20:08 brettcannon

Sorry about the delayed response on this!

What we are looking for is a PR w/ a framework that stubs out hypothesis in the stdlib (@pganssle was working on this at some point, but we don't think he is currently). Probably an email to python-dev to gather help wouldn't hurt. Otherwise once a PR showing how it will work is ready then a separate email to python-dev to allay concerns, get feedback, etc. would be good.

Nov 22 '21 20:11 brettcannon

We talked about this again and we agreed we need a working, practical solution for the core devs to evaluate.

Feb 28 '22 19:02 brettcannon

With PEP-678 done (for now), I'll aim to have a working up-to-date PR for evaluation by or at PyCon 👍

Mar 01 '22 00:03 Zac-HD

any update on this?

Oct 10 '22 18:10 gpshead

Since I (as hyptothesis team member) am very interested in bringing this to CPython, I would like to propose a solution.

TLDR (I've made a PR with the setup example): https://github.com/sobolevn/cpython/pull/4

largely around having a 3rd party dependency in the core tests

I think this is the main problem right now. As far as I know, CPython's test framework does not have any 3rd party dependencies (except for optional coverage, which is not directly used).

But, some other tasks have:

sphinx / sphinx-lint / python-docs-theme for docs
blurb / cherry-pick for maintenance
Maybe something else, which I am not familiar with

I think we should treat hypothesis tests more like Tools/ not like Lib/test thing. So, we can store hypothesis tests separately in Tools/hypothesis/tests/ with its own docs and workflows.

I also don't think that running hypothesis on every build is worth it. Instead, we can:

Run hypothesis on cron, like once a week just to be sure. We can use the same workflow we use in typeshed when any cron failures are reported as new issues. Example, code. This issues can directly assign me to triage them.
Trigger hypothesis test with test-with-hypothesis label, like we do with test-with-buildbots (btw, we can do it natively with GitHub Actions)
Manual trigger

Basically, the action file would look like so: https://github.com/sobolevn/cpython/blob/8e20a3185939d01723ba928c35393ced0bf55053/.github/workflows/hypothesis.yml

This looks quite simple to me. And what is the most important here: we can decrease the complexity for regular contributors by not to dealing with quite complex and unknown property-based tests.

Versioning

hypothesis has a lot of releases, this is why I use >=6,<7 to fix only a major release. Bumping a major release can be done once in ~two years.

Next steps

[ ] Moving https://github.com/Zac-HD/stdlib-property-tests to the setup I proposed, so we can start testing things
[ ] Adding new tests for things that need it the most
[ ] Adding docs about how to work with this
[ ] Adding policies: when to run this

Feedback is welcome!

CC @Zac-HD

Oct 11 '22 11:10 sobolevn

This sounds like a very reasonable plan to me. I don't want the core test suite to depend on packages from outside our repo, but a scheduled buildbot totally can.

An additional policy to determine is "what do we do when tests fail?" The downside of a non-blocking test suite is that someone has to be watching and reacting to it. We can't just assume that'll happen.

Oct 11 '22 16:10 zooba

An additional policy to determine is "what do we do when tests fail?" The downside of a non-blocking test suite is that someone has to be watching and reacting to it. We can't just assume that'll happen.

We could have a bot automatically create an issue when the workflow fails. That would bring greater visibility to the failure. We have a daily workflow at typeshed which does this (though it has a few warts, e.g. you have to click through to the workflow in order to see exactly which tests failed where).

Typeshed workflow: https://github.com/python/typeshed/blob/708996dff09581e52ae6c91711f82d293d3ec542/.github/workflows/daily.yml#L85-L103

Example issue: https://github.com/python/typeshed/issues/8818

Oct 11 '22 16:10 AlexWaygood

Apologies for letting this slip for so long.

I'd pretty strongly prefer @pganssle's approach of providing a Hypothesis shim, so that you can run without Hypothesis installed - in which case the @example(...) cases are treated as subtests - and including this in the usual tests.

I'm also pretty nervous that running as a cron job will make the experience much more frustrating, as problems are reported after they're merged rather than before. What's the actual downside of running a Hypothesis job in CI? The compute cost is trivial, and failures have to be addressed at some point regardless.

(My main project for this month is open-sourcing the fuzzer I've been building for a while; once I get the config and workflow nailed down that would be a great thing to drop in a cron job. Happy to be pulled in on CPython issues or PRs anytime though, they'll take even higher priority.)

Oct 12 '22 01:10 Zac-HD

@Zac-HD, yes, I agree that running hypothesis tests as the required part of the CI is much better. But, this might not happen due to the complexity it add sto the whole process.

For example, let's say some new contributor adds something / fixes a bug. And hypothesis tests fail right inside the CI. What to do? I sometimes scratch my head about it for hours.

But, in the scenario I proposed as a compromise: we can still benefit from having hypothesis in the CPython.

With this approach we can:

Assign me on new reported issues to triage / fix them
Hope that someday it will become mandatory: when we will have actual production experience and more team expertise

So, please consider my proposal as plan "B", so we can actually get this merged in the nearest future. But, if SC is fine with the @pganssle's solution - I would be even more happy 😊

Oct 12 '22 11:10 sobolevn

Our required CI tests are already too unstable. If we add anything to them, I'd want it to be entirely static, and ideally transparent, which basically means a list of all the test cases.

Provided they're checked in and it's obvious which test is failing, I'm not concerned about how they're generated, but they can't depend on installing anything from PyPI. And as sobolevn says, they should never fail on unrelated PRs. I'm not convinced that (a) hypothesis can guarantee that (and I say that as a user who deals with new random failures due to hypothesis in my other projects), and (b) if it does that it still provides any value over a static list of examples.

Oct 12 '22 16:10 zooba

Our required CI tests are already too unstable. If we add anything to them, I'd want it to be entirely static, and ideally transparent, which basically means a list of all the test cases.

I think this makes Paul's stubs approach a good one: it's a dependency-free list of test cases written out in the source code, but with the advantage that you can use Hypothesis to find additional test cases outside of your required CI tests, e.g. in a cronjob. I'd personally still run a non-required CI job with Hypothesis installed too, but I'm not going to argue further.

My colleague Nelson wrote about the workflow here. I've also been playing with the idea of using the fuzzer to find a minimal set of covering examples which can be automatically added to the source code; if CPython would use that I'll bump it to the top of my todo list.

Oct 12 '22 17:10 Zac-HD

🎊 https://github.com/python/cpython/pull/22863 has been merged, and property-based tests are now run in CPython's CI 🎊

There are plenty of possible enhancements (more tests, fuzzing, etc.), but none that require a decision by the Steering Council. With thanks for your patience, I think it's time to close this issue at last!

May 22 '23 15:05 Zac-HD

steering-council steering-council copied to clipboard

Action items: adopting property-based testing in CPython's CI

Versioning

Next steps

steering-council
steering-council copied to clipboard