iree Refactor benchmark jobs in ci.yml

Following this discussion on Discord.

These jobs in the ci.yml workflow are increasing infrastructure complexity and are a bottleneck on total presubmit time:

build_benchmark_tools
build_e2e_test_artifacts
test_benchmark_suites
compilation_benchmarks (opt-in)
execution_benchmarks (opt-in)
process_benchmark_results (opt-in)

When no benchmarks are enabled, the build_e2e_test_artifacts -> test_benchmark_suites jobs still run. The intent here was to verify that benchmark suite models can be successfully compiled and executed correctly even if they are not benchmarked. The reality is that the test_benchmark_suites workflows are only running one test: iree/tools/test/iree_run_module_correctness_test, and the build_e2e_test_artifacts workflow takes 9 minutes to build multiple platform variants of the same program. The coverage provided has redundancies and the overall setup has multiple excess minutes of overhead (not even counting time spent waiting for expensive self-hosted runners.

I propose we

remove the test_benchmark_suites matrix entirely
move build_benchmark_tools and build_e2e_test_artifacts to opt-in based on benchmarks being requested (using a preset?)
(as needed) add correctness integration test cases for benchmark programs to experimental/regression_suite rather than try to bolt something on to the existing benchmarking infrastructure like those "run module tests"
(as cleanup) eject the benchmark jobs from ci.yml and place them in benchmark.yml
- they can still use presubmit artifacts via GCS storage or GitHub Actions Artifacts. Ideally they could also run on release artifacts
- they can continue to use configure_ci.py and PR labels as they currently do

cc @pzread

Feb 16 '24 00:02 ScottTodd

FWIW, I looked at the logs for a sample run of build_e2e_test_artifacts to see if there were some easy optimizations to make and it doesn't look like it.

4m spent compiling various programs (no huge outlier program like other jobs have had before)
1m40s spent uploading outputs
1m fetching Python deps from pip
40 seconds downloading a Dockerfile
10s configuring CMake
< 5s downloading sources and importing them

just... lots of overhead and then a decent number of programs to compile

Feb 16 '24 00:02 ScottTodd

SG. I'll say from time to time we catch problems with build_e2e_test_artifacts, but we can switch and observe failure in postsubmit, then pick the important models into regression_suite. And thanks for looking into these

Feb 16 '24 17:02 pzread

For the test_benchmark_suites, @hcindyl might know more about it

Feb 16 '24 17:02 pzread

I'm seeing if I can pull some statistics for how long each job spends queued right now. (If that ends up taking longer than I'd like I may just proceed without data, but I'd like to know...). For queueing, we could also add more runners, or switch those jobs to use GitHub-hosted runners instead of our own.

I'd rather just eliminate the overhead and complexity instead of throwing more/different resources at it though. Queueing time / job orchestration is only part of the costs.

Feb 16 '24 17:02 ScottTodd

See https://github.com/openxla/iree/pull/16438#issuecomment-1949078215 for the use of test_benchmark_suites. It is placed in the CI to check LLVM codegen regression for a few realistic models. I have lost track on when they got turned off in https://github.com/openxla/iree/tree/main/tests/e2e, but historically the test gave us some warnings of the RISCV codegen breakage before it trickles down to the project that uses IREE, e.g., https://opensecura.googlesource.com/sw/vec_iree and https://opensecura.googlesource.com/ml-models-public

Feb 16 '24 18:02 hcindyl

Slight adjustments to the plans from the initial issue description:

build_benchmark_tools, build_e2e_test_artifacts, and test_benchmark_suites are now opt-in and will only run when benchmarks are requested (including all postsubmit runs)
test_benchmark_suites has some useful coverage again, but those tests should be moved to a regular (non-benchmark) presubmit CI job - likely under experimental/regression_suite, but cross_compile_and_test in ci.yml could also be used in the short term
the benchmark jobs should still be moved out of ci.yml

Feb 19 '24 19:02 ScottTodd

iree iree copied to clipboard

Refactor benchmark jobs in ci.yml

iree
iree copied to clipboard