Venturecxx
Venturecxx copied to clipboard
The examples are tested unsatisfactorily
There are integration tests that amount to testing that they do not crash on startup: run them under "timeout" and check that they indeed time out.
Subproblems:
- Don't want the integration test suite to block on a 20-minute inference-quality level run of e.g. Crosscat just to test one example; presumed solution is to give them --smoketest flags or something
- Some try to make plots, which may get splattered into either the current directory or, worse yet, on screen.
Part of #48 .
Edit: See updated summary below (dated Nov 10, 2015).
There are also still a few IPython notebooks knocking around in the examples/ directory. We should either deploy a means to mechanically test them, or finally get rid of them.
https://github.com/bollwyvl/nosebook might be useful?
Where do these integration tests live? It might be worth it to have a jenkins instance that does expect to take an hour or more per run, but gets run less frequently, perhaps daily. Certainly long-running jobs seem like an actual use case, so making sure that we aren't crashing for some reason on larger problems, e.g. due to a new memory leak, etc., seems worthwhile.
The screen problem seems common, and worth fixing. Given that we're likely to use the plot-to-screen functionality frequently and notice if it breaks, I'd emphasize mechanically testing the plot-to-file variant instead. Writing commands that check that e.g. 1<x<50% of such a file is dark pixels seems like a decent way to check that they're doing something reasonable, though of course those bounds should be empirically set.
The notebooks question feels like a separate bug. Perhaps part of #50 instead. I'll add luac's suggestion there.
The extant tests are in test/integration/test_examples.py
In principle every code artifact under the examples/ directory should be either tested or deleted.
It's been a while since I looked through them to ascertain what they were examples of and whether they were still useful as examples.
I agree that in principle a slow integration test that runs them with large parameters is a useful Jenkins job, but I would be uncomfortable if that were the only thing that exercised them, so I think fast smoke tests are still valuable.
After discussion today, we want to convert all the ipy notebooks and all the other sets of examples to the erb/markdown/venture format used for the tutorial, and support basic assertions about these example sequences.
We didn't discuss how to make smoke testing work, but I could imagine e.g. mechanically transforming all numbers >=3 to be =3 instead or similar simple transformations, running through the entire suite, and ensuring that the types of results at each point are what we expect: a number, a plot, nothing, etc.
For the longer-running tests, @riastradh-probcomp 's suggestion of having golden files at some level of comparison still bugs me, but I'm not sure of the value of just running them, especially when there is any kind of interactivity expected (e.g. to close a plot before going on). My inclination would be to stub plot-generation for this longer-running version, and assert that the full set of examples takes about as long to run through as we expect historically, and that the plots have roughly the same color composition as we expect historically. The problem of blessing a particular instance of the history still comes up, as well as what it means to be "roughly the same".
So now this devolves into a few tasks:
- deciding what's worth converting
- doing the conversion
- ensuring that the generation/testing doesn't break on it. I will put off starting this until Issue #54 is better resolved, because the mechanism for that is likely to help here. My priority would be to do the simplified smoke tests first, and only later to think through how to do the longer integration runs.
I interpret the above conversation as implying that this issue was blocked on #54. That being closed, labeling unblocked.
To clarify the current goal here:
- We want to crash test every example in
examples/
except those to be pruned by #144 - [The IPython notebooks in
examples/notebooks
are optional with respect to Release 0.4.3] - The examples should be abstracted to
- be importable (if written in Python)
- permit programmatic control of time/accuracy tradeoffs (# runs, # transitions, # observations, etc)
- permit programmatic control of whether to plot on screen, else what directory to plot to
- ideally should retain a main program that runs with a reasonable-size parameters and displays plots interactively (if there are few enough of them)
- Tests should be added to import the example and run it to completion with small numbers and plotting to a temporary directory
- Could also follow Grem's suggestion and check that the resulting images have between 1% and 50% dark pixels
- Tests in
test/integration/test_examples.py
that rely on external processes and callingtimeout
should be flushed.
Let's call this one the stretch goal for a good release 0.5. Labeling blocked at least on #51, and probably even on everything else in the release 0.5 milestone.
The punch list:
- [x] crosscat.vnt
- [x] crp_2d_demo.py
- [x] gaussian_funnel.py
- [x] gaussian_geweke.py
- [x] hmc-demo.py
- [ ] hmm.vnt
- [x] lda.vnt
- [x] profile_tricky_coin.py
- [x] trickiness-ideal.vnts
- [x] trickiness-concrete.vnts
- [x] trickiness-concrete-2.vnts
- [x] brownian/{film,plot}.vnt
- [x] examples/plotting/*.vnt
- [x] examples/ppaml-talk/pipits.vnt
- [x] examples/venstan/*.vnts
Is nesterov
broken in the Brownian motion example?
Will not test examples/brownian.
Will not test examples/venstan. That integration is sufficiently marginal, and is itself tested separately, not to bother keeping the example pristine.
As the one who added pipits last summer, I don't think it's worth maintaining; doesn't really exemplify anything in particular.