Introduce compiler timing script & CI job
We have an existing benchcomp script to measure Kani's performance in CI, but it is prone to noise and only focuses on the end-to-end performance of compilation and verification together, leaving a need for more granular measurements.
This script focuses solely on changes in the runtime of the Kani compiler and runs with warm ups, repeats and outlier detection (based on the rust compiler's method in rustc-perf) in an attempt to limit noise. The new compile-timer-short CI job uses this script on a subset of the perf tests (currently excluding s2n-quic, kani-lib/arbitrary & misc/display-trait) to produce a benchcomp-like table comparing the compiler performance per-crate before and after a given commit.
This also modifies our auto-labeller to ensure end-to-end benchmarks (like benchcomp) and the new compiler-specific ones are only run when the parts of Kani that they profile have changed.
I manually tested the CI job on my personal fork (see this old successful run or this up-to-date intentionally failed run from a test PR that slows down the compiler).
Resolves #2442
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 and MIT licenses.
I also ran this workflow on an old, slightly performance-regressing PR (#3374), and it successfully detects the small regression with this output.
I haven't reviewed the code yet, but a thought--I think we'll get more stable results out of this if we run on more LoC--the perf tests are a good starting point (and a way to get good feedback), but given that we're measuring compilation times I'd imagine the results would be more stable if we had more to compile.
I'm not opposed to having a short job for quick feedback, though. Are you planning to add a longer one later?
(I realize that I told you to start on the perf tests, so it's not a criticism 😁 ) I think we should also test on some bigger codebases too.
@carolynzech I've added an additional compiler-timing-long job that runs on all the s2n-quic harnesses and takes about ~90s for each full compilation run (compared to the -short job which only takes ~3s). This should reduce noise even more with the tradeoff of taking about 50 minutes to run in CI rather than 15.
Here's the link to the newest CI run on my branch that has updated significance thresholds and small formatting changes. Should be finally ready for review now!
Here is a run of the workflow on a PR that we expect to cause a small-to-medium regression. I tried to create a change that isn't overly contrived. The results gives me confidence that the workflow is at a good spot with sensitivity--it catches the regression comfortably, without failing on every harness (which might suggest it's overly sensitive).