Venturecxx The examples are tested unsatisfactorily

There are integration tests that amount to testing that they do not crash on startup: run them under "timeout" and check that they indeed time out.

Subproblems:

Don't want the integration test suite to block on a 20-minute inference-quality level run of e.g. Crosscat just to test one example; presumed solution is to give them --smoketest flags or something
Some try to make plots, which may get splattered into either the current directory or, worse yet, on screen.

Part of #48 .

Edit: See updated summary below (dated Nov 10, 2015).

Jul 24 '15 15:07 axch

There are also still a few IPython notebooks knocking around in the examples/ directory. We should either deploy a means to mechanically test them, or finally get rid of them.

Jul 24 '15 19:07 axch

https://github.com/bollwyvl/nosebook might be useful?

Jul 24 '15 19:07 lenaqr

Where do these integration tests live? It might be worth it to have a jenkins instance that does expect to take an hour or more per run, but gets run less frequently, perhaps daily. Certainly long-running jobs seem like an actual use case, so making sure that we aren't crashing for some reason on larger problems, e.g. due to a new memory leak, etc., seems worthwhile.

The screen problem seems common, and worth fixing. Given that we're likely to use the plot-to-screen functionality frequently and notice if it breaks, I'd emphasize mechanically testing the plot-to-file variant instead. Writing commands that check that e.g. 1<x<50% of such a file is dark pixels seems like a decent way to check that they're doing something reasonable, though of course those bounds should be empirically set.

The notebooks question feels like a separate bug. Perhaps part of #50 instead. I'll add luac's suggestion there.

Jul 24 '15 21:07 gregory-marton

The extant tests are in test/integration/test_examples.py

In principle every code artifact under the examples/ directory should be either tested or deleted.

It's been a while since I looked through them to ascertain what they were examples of and whether they were still useful as examples.

I agree that in principle a slow integration test that runs them with large parameters is a useful Jenkins job, but I would be uncomfortable if that were the only thing that exercised them, so I think fast smoke tests are still valuable.

Jul 24 '15 23:07 axch

After discussion today, we want to convert all the ipy notebooks and all the other sets of examples to the erb/markdown/venture format used for the tutorial, and support basic assertions about these example sequences.

We didn't discuss how to make smoke testing work, but I could imagine e.g. mechanically transforming all numbers >=3 to be =3 instead or similar simple transformations, running through the entire suite, and ensuring that the types of results at each point are what we expect: a number, a plot, nothing, etc.

For the longer-running tests, @riastradh-probcomp 's suggestion of having golden files at some level of comparison still bugs me, but I'm not sure of the value of just running them, especially when there is any kind of interactivity expected (e.g. to close a plot before going on). My inclination would be to stub plot-generation for this longer-running version, and assert that the full set of examples takes about as long to run through as we expect historically, and that the plots have roughly the same color composition as we expect historically. The problem of blessing a particular instance of the history still comes up, as well as what it means to be "roughly the same".

So now this devolves into a few tasks:

deciding what's worth converting
doing the conversion
ensuring that the generation/testing doesn't break on it. I will put off starting this until Issue #54 is better resolved, because the mechanism for that is likely to help here. My priority would be to do the simplified smoke tests first, and only later to think through how to do the longer integration runs.

Jul 28 '15 23:07 gregory-marton

I interpret the above conversation as implying that this issue was blocked on #54. That being closed, labeling unblocked.

Nov 06 '15 21:11 axch

To clarify the current goal here:

We want to crash test every example in examples/ except those to be pruned by #144
[The IPython notebooks in examples/notebooks are optional with respect to Release 0.4.3]
The examples should be abstracted to
- be importable (if written in Python)
- permit programmatic control of time/accuracy tradeoffs (# runs, # transitions, # observations, etc)
- permit programmatic control of whether to plot on screen, else what directory to plot to
- ideally should retain a main program that runs with a reasonable-size parameters and displays plots interactively (if there are few enough of them)
Tests should be added to import the example and run it to completion with small numbers and plotting to a temporary directory
Could also follow Grem's suggestion and check that the resulting images have between 1% and 50% dark pixels
Tests in test/integration/test_examples.py that rely on external processes and calling timeout should be flushed.

Nov 10 '15 19:11 axch

Let's call this one the stretch goal for a good release 0.5. Labeling blocked at least on #51, and probably even on everything else in the release 0.5 milestone.

Jan 13 '16 22:01 axch

The punch list:

[x] crosscat.vnt
[x] crp_2d_demo.py
[x] gaussian_funnel.py
[x] gaussian_geweke.py
[x] hmc-demo.py
[ ] hmm.vnt
[x] lda.vnt
[x] profile_tricky_coin.py
[x] trickiness-ideal.vnts
[x] trickiness-concrete.vnts
[x] trickiness-concrete-2.vnts
[x] brownian/{film,plot}.vnt
[x] examples/plotting/*.vnt
[x] examples/ppaml-talk/pipits.vnt
[x] examples/venstan/*.vnts

Jan 16 '16 20:01 axch

Is nesterov broken in the Brownian motion example?

Jan 30 '16 14:01 axch

Will not test examples/brownian.

Jul 24 '16 22:07 axch

Will not test examples/venstan. That integration is sufficiently marginal, and is itself tested separately, not to bother keeping the example pristine.

Jul 24 '16 23:07 axch

As the one who added pipits last summer, I don't think it's worth maintaining; doesn't really exemplify anything in particular.

Jul 25 '16 22:07 lenaqr

Venturecxx Venturecxx copied to clipboard

The examples are tested unsatisfactorily

Venturecxx
Venturecxx copied to clipboard