Benchmark Regression Test Suite

Open AdamGleave opened this issue 3 years ago • 0 comments

Problem

imitation's testing is currently limited to static analysis (type checking, linting, etc) and unit testing. There are no automated, end-to-end tests of algorithm training performance. This is problematic as small implementation details in reward and imitation learning can have big impacts on performance.

Solution

We do, however, already have tuned hyperparameters and some initial results in https://github.com/HumanCompatibleAI/imitation/tree/master/benchmarking thanks to @taufeeque9 (PR with the code used for hyperparameter tuning should be forthcoming soon as well). This suggests that we could simply have a test suite that trains all the algorithms using these existing configs and records performance. If the performance drops by more than some threshold, then a warning or error could be issued.

Training end-to-end is far too slow to do on every commit, but it's something we could afford to do before each (significant) release or before merging PRs that we're worried might cause regressions.

There are some tools already that are designed to track metrics over time like airspeed velocity (asv). Integrating it with one of these might make sense, but I don't yet know how well those features line up.

Possible alternative solutions

We could also just add a handful of tests marked as "expensive" to pytest that are skipped by default and can be run on-demand, that do end-to-end training and assert reward is above some threshold. This might give us a non-trivial amount of the benefit but with much less work.

Dec 28 '22 02:12 AdamGleave