iree Convert more CI jobs to be package-based

Current tasks

### Tasks
- [x] Refactor CMake project so all "integration tests", even those using `lit`, can be run using packages instead of source builds (`-DIREE_BUILD_COMPILER`)
- [x] Create `install-dir` in `build_all` step
- [x] Switch CI jobs from `build-dir` to `install-dir`
- [x] Move `test_tf_integrations` job to pkgci
- [ ] Build iree-dist in pkgci.yml at the head of the pipeline
- [ ] Add "integration tests" workflow that uses packages from a previous job or a github release
- [ ] Move `test_gpu` job to pkgci
- [ ] Give recently changed workflows new names that better match their scope
- [ ] Optimize CI runtime build times by refactoring the HAL CTS
- [ ] Use GitHub-hosted runners in more jobs if possible
- [ ] Migrate / clean-up Docker images ('frontends' images can likely go away)
- [ ] Create a CMake subproject for integration tests to better separate them from the core project

Background information

Test workflows (for integration and e2e tests as well as benchmarks) should be consuming packages, not source build directories. Build workflows could run unit tests.

For example, the test_all workflow currently downloads a nearly 10GB build archive here: https://github.com/openxla/iree/blob/79b6129e2333ae26e7e13b68c27566102dcece6e/.github/workflows/ci.yml#L290-L294

The PkgCI workflows show how this could be set up.

This may involve restructuring some parts of the project (like tests/e2e/) to be primarily based on packages and not the full CMake project.

In ci.yml, these are the jobs that currently depend on the build archive produced by build_all:

[x] test_all
[x] test_gpu
[x] test_a100
[x] test_tf_integrations
[x] test_tf_integrations_gpu
[x] build_benchmark_tools
[x] build_e2e_test_artifacts
[x] cross_compile_and_test
[x] build_and_test_android
[x] test_benchmark_suites

Related discussions:

Discord discussion on 2024-01-24

Jan 24 '24 20:01 ScottTodd

We could separate "unit tests" from "integration/e2e tests" in the CMake project. Unit tests should be able to run right after the build step, while integration tests should use a release/dist package for compiler tools and a [cross-compiled] runtime build for test binaries.

I'm considering a nested CMake project for integration tests, replacing the iree-test-deps utility target, but that might not be needed.

Take these test_gpu logs as an example.

That is running this command:

        run: |
          ./build_tools/github_actions/docker_run.sh \
              --env IREE_NVIDIA_SM80_TESTS_DISABLE \
              --env IREE_MULTI_DEVICE_TESTS_DISABLE \
              --env IREE_CTEST_LABEL_REGEX \
              --env IREE_VULKAN_DISABLE=0 \
              --env IREE_VULKAN_F16_DISABLE=0 \
              --env IREE_CUDA_DISABLE=0 \
              --env IREE_NVIDIA_GPU_TESTS_DISABLE=0 \
              --env CTEST_PARALLEL_LEVEL=2 \
              --env NVIDIA_DRIVER_CAPABILITIES=all \
              --gpus all \
              gcr.io/iree-oss/nvidia@sha256:892fefbdf90c93b407303adadfa87f22c0f1e84b7e819e69643c78fc5927c2ba \
              bash -euo pipefail -c \
                "./build_tools/scripts/check_cuda.sh
                ./build_tools/scripts/check_vulkan.sh
                ./build_tools/cmake/ctest_all.sh ${BUILD_DIR}"

with all of those filters set, these are the only test source directories included:

iree/hal/drivers/cuda2/cts                          =  83.08 sec*proc (24 tests)
iree/hal/drivers/vulkan                             =   0.33 sec*proc (1 test)
iree/hal/drivers/vulkan/cts                         =  44.97 sec*proc (12 tests)
iree/modules/check/test                             =   5.95 sec*proc (2 tests)
iree/samples/custom_dispatch/vulkan/shaders         =   2.19 sec*proc (2 tests)
iree/samples/simple_embedding                       =   0.65 sec*proc (1 test)
iree/tests/e2e/linalg                               =   8.50 sec*proc (7 tests)
iree/tests/e2e/linalg_ext_ops                       =  32.42 sec*proc (11 tests)
iree/tests/e2e/matmul                               =   8.48 sec*proc (5 tests)
iree/tests/e2e/regression                           =  47.80 sec*proc (41 tests)
iree/tests/e2e/stablehlo_models/mnist_train_test    =  29.41 sec*proc (2 tests)
iree/tests/e2e/stablehlo_ops                        = 410.55 sec*proc (302 tests)
iree/tests/e2e/tensor_ops                           =   9.71 sec*proc (6 tests)
iree/tests/e2e/tosa_ops                             = 142.57 sec*proc (120 tests)
iree/tests/e2e/vulkan_specific                      =   5.20 sec*proc (5 tests)
iree/tests/transform_dialect/cuda                   =   0.20 sec*proc (1 test)

iree/hal/drivers (except cts) is pure runtime code
iree/hal/drivers/*/cts can use compiler tools
iree/tests/ is mostly comprised of "check" tests. A few tests use lit (using iree-compile, iree-opt, iree-run-mlir, FileCheck, iree-run-module, etc.)

Jan 25 '24 00:01 ScottTodd

+1 - I've been meaning to do something like that.

The line between unit and integration tests can sometimes be blurred a bit but a unit test can never require special hardware. That belongs in something that can be independently executed with tools provided out of band

Jan 25 '24 00:01 stellaraccident

I'm deciding which of these job sequences to aim for:

build_dist_package --> compile_test_deps --> test_gpu
build_dist_package --> test_gpu

Portable targets like Android require running the compiler on a host machine, but other jobs like test_gpu that run on Linux/Windows can run the compiler if they want. The current tests/e2e/ folder after building iree-test-deps is ~34MB with ~1500 files and large CPU runners take around 30 seconds to generate all of those .vmfb files. All of those stats will likely grow over time. We don't really want to be spending CPU time on GPU machines, but keeping that flexibility for e2e Python tests could be useful.

Jan 25 '24 19:01 ScottTodd

Got some good data from my test PR https://github.com/openxla/iree/pull/16216 (on the first try too, woohoo!)

Here's a sample run using just the "install" dir from a prior job: https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216.

Stage	Time taken	Notes
Checkout	1m16s	Could use the smaller 'runtime' checkout?
Download install/ dir	5s	2.97GB file, could be trimmed - see notes below
Extract install/ dir	30s
Build runtime	1m
Build test deps	2m04s	Currently generating all test .vmfb files, even those for CPU
Test all	6m20s

TOTAL	11m47s

The install dir has two copies of the 3.6GB libIREECompiler.so that don't appear to be symlinks:

install/lib/libIREECompiler.so
install/python_packages/iree_compiler/iree/compiler/_mlir_libs/libIREECompiler.so

If this was using an "iree-dist" package instead of the full "install/" directory, that would just be the bin/ and lib/ folders, instead of also including python_packages/.

Compared to a baseline run using the full "build" dir from a prior job: https://github.com/openxla/iree/actions/runs/7647953292/job/20840561210

Stage	Time taken	Notes
Checkout	46s
Download build/ dir	9s	5.94 GB (was 8.5GB)
Extract install/ dir	1m01s
Test all	6m37s

TOTAL	8m46s

It doesn't seem too unreasonable to keep the test artifact generation on the same job (on a GPU machine), at least with the current number of test cases. It would be nice to share workflow configurations between "desktop Linux GPU" and "Android GPU" though.

Jan 25 '24 20:01 ScottTodd

oof at double libIREECompiler.so

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

Jan 25 '24 20:01 benvanik

oof at double libIREECompiler.so

Stella has been suggesting using a package distribution like iree-dist (without that problem), but I was just starting with the install/ directory from a regular source build.

may need some more samples - 46s -> 1m16s for the same checkout makes me wonder if the timescales match

That variance looks about right for Linux (just clicking through asan, tsan, build_all, test_all, etc. jobs on a run like https://github.com/openxla/iree/actions/runs/7647953292 and looking at the checkout step in each.

The smaller "runtime only" checkout is more like 5s (no submodules) + 6s (runtime submodules): https://github.com/openxla/iree/actions/runs/7647953292/job/20839868561.

Having these tests jobs use the main project makes drawing solid lines around components like "compiler" and "runtime" hard though. I don't really want to accidently exclude certain tests by forcing "integration test" jobs to use the big -DIREE_BUILD_COMPILER=OFF hammer. If tests were in a separate CMake project (nested in a subdir of the repo) then it would be easier to say to test authors/maintainers (oh... that's me) "these have to use installed artifacts, work with that".

Jan 25 '24 20:01 ScottTodd

I think I'll try converting tests/ (and later possibly samples/) into a standalone CMake project, possibly with the ability to still include it from the root project for developer source builds.

The test_gpu job would still build the runtime (for CUDA, Vulkan, etc. unit tests that run on a GPU), but it will also build the "tests" subproject using iree-dist (or install/). That may pull in FileCheck, llvm-lit.py, and other tools from LLVM that are needed to run the tests, but it will not build the compiler binaries from source.

Jan 26 '24 00:01 ScottTodd

Found a few things to fix first / as part of this.

Most of the tests in tests/e2e/stablehlo_models/ have been skipped, I think following https://github.com/openxla/iree/pull/15837 with this change:

ctest_all.sh
+if (( IREE_METAL_DISABLE == 1 )); then
+  label_exclude_args+=("^driver=metal$")
+fi

The combination of these different filtering mechanisms is excluding tests since no CI configuration has both Vulkan and Metal: https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/CMakeLists.txt#L32-L35 https://github.com/openxla/iree/blob/f3b008c6db310f787ad76f151c21a30f72b14794/tests/e2e/stablehlo_models/edge_detection.mlir#L2-L4

I don't think we should use RUN: [[ $IREE_*_DISABLE == 1 ]] any longer. These tests should have separate targets for each HAL driver ('check' test suites do this automatically).

Jan 26 '24 16:01 ScottTodd

I was considering disallowing "lit" integration tests in tests/ altogether, but many are legitimate uses:

So we should have the 'lit' / FileCheck tooling still available on host platforms that run those tests I think.

Jan 26 '24 16:01 ScottTodd

Yeah, we can fix that double compiler binary thing with the proper flow.

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

Jan 26 '24 18:01 stellaraccident

Looks like the main variance in timing is coming from build test deps. Is that mostly coming down to CMake configure or something? Other than that, it is the same work done in a different place.

Here is the timing on the GPU machine:

Step	Timing
Configure	20s
Build runtime	60s
Build test deps (all)	2m04s
Run ctest (GPU only)	6m20s

The "build test deps" step is running the compiler to generate .vmfb files for iree-check-module:

[1/[14](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:15)[21](https://github.com/openxla/iree/actions/runs/7659217597/job/20874202528?pr=16216#step:9:22)] Generating check_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[2/1421] Generating check_winograd_vulkan-spirv_vulkan_conv2d.mlir_module.vmfb from conv2d.mlir
[3/1421] Generating check_large_linalg_matmul_cuda_f32_to_i4.mlir_module.vmfb from f32_to_i4.mlir
[4/1421] Generating check_large_linalg_matmul_cuda_conv2d.mlir_module.vmfb from conv2d.mlir
[5/1421] Generating check_vmvx_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[6/1421] Generating check_large_linalg_matmul_cuda_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[7/1421] Generating check_vulkan-spirv_vulkan_i4_to_f32.mlir_module.vmfb from i4_to_f32.mlir
[8/1421] Generating check_winograd_llvm-cpu_local-task_conv2d.mlir_module.vmfb from conv2d.mlir
[9/1421] Generating check_llvm-cpu_local-task_sort.mlir_module.vmfb from sort.mlir

If "build test deps" was taking 5+ minutes, I'd be more inclined to move it to a CPU machine and pass the test files to the GPU runner, as we have the Android / RISC-V cross-compilation test jobs currently configured. We might still end up with that sort of setup eventually, but I don't think it is needed for a "v0" of package-based CI.

Jan 26 '24 18:01 ScottTodd

Yeah, and I'd rather make the compiler faster than build exotic infra unless if it becomes needed...

Jan 26 '24 18:01 stellaraccident

Made some pretty good progress on prerequisite tasks this week.

The latest thing I'm trying to enable is the use of iree_lit_test without needing to build the full compiler. That would let us run various tests under samples/ and tests/ that are using lit instead of check (some for good reasons, others just by convention) with a iree-dist package*.

I have a WIP commit here that gets close: https://github.com/ScottTodd/iree/commit/47bb19ab3ef761118627c87a1a455ad2eb4ee2eb

* another way to enable that is to allow test jobs to set IREE_BUILD_COMPILER and then be very careful about which targets those jobs choose to build before running tests (ctest for now, but also pytest in the future?). Some tests require building googletest binaries like hal/drivers/vulkan/cts/vulkan_driver_test.exe, so I've been leaning on just building the all target, but we could instead have more utility targets like iree-test-deps. Actually, something like that may be a more robust idea... @stellaraccident have any suggestions? Something like iree-run-tests from https://github.com/openxla/iree/pull/12156 ?

Jan 27 '24 00:01 ScottTodd

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

Jan 27 '24 00:01 benvanik

Simple is what I'm aiming for... just figuring out how to get there still.

I want test jobs to run any tests that either use special hardware (GPUs) or use both the compiler and the runtime.

I'd like them to follow this pattern:

cmake -B ../iree-build-tests/ . -DIREE_HOST_BIN_DIR={PATH_TO_IREE_DIST} {CONFIGURE_OPTIONS}
cmake --build ../iree-build-tests/ {--target SOME_TARGET?}
ctest --test-dir ../iree-build-tests/

If {CONFIGURE_OPTIONS} can be empty or at least leave off -DIREE_BUILD_COMPILER=OFF then great
If the build step can build the default all, great. Otherwise, I can build iree-test-deps or some similar utility target

Jan 27 '24 00:01 ScottTodd

would really like to not overload IREE_BUILD_COMPILER - KISS - if we can't make iree_lit_test use FileCheck from the package for some reason then we should convert all those tests to something else (check tests, cmake tests, etc)

What Ben says. I've been down the path of mixing this stuff up and don't want to see anyone else fall in and then have to dig out of the pit.

We can package some of the test utilities. No big deal.

Jan 27 '24 20:01 stellaraccident

I think enough of the tests are segmented now (or will be once I land a few PRs).

Next I was planning on

switching a few existing jobs from using the "build" archive to using the "install" archive in-place. That will let us closely check for coverage gaps
switching from "install" to "iree-dist", try using github artifacts instead of GCS
taking a closer look at which build jobs are actually needed and how they are configured... I have a feeling that we don't need asan/tsan/tracing/debug/etc. as currently implemented (we still want that coverage... but the current jobs are unwieldy)

Jan 31 '24 00:01 ScottTodd

switching a few existing jobs from using the "build" archive to using the "install" archive in-place

Or I could just fork the jobs over the pkgci... that might be better. 🤔 (would want iree-dist over there, and maybe a few extra tools included like lit.py)

Jan 31 '24 14:01 ScottTodd

Okay, I have a script that "tests a package": https://github.com/ScottTodd/iree/commit/43b3922191e6df895752ef5ac472169329802a50 . Going to try wiring that up to GitHub Actions, first pointed at the iree-dist-*.tar.xz files from a nightly release like https://github.com/openxla/iree/releases/tag/candidate-20240131.787.

The script is basically:

cmake -B build-tests/ -DIREE_BUILD_COMPILER=OFF -DIREE_HOST_BIN_DIR={PACKAGE_DIR}/bin
cmake --build build-tests/
cmake --build build-tests/ --target iree-test-deps
ctest --test-dir build-tests

with all the filtering goo from https://github.com/openxla/iree/blob/main/build_tools/cmake/ctest_all.sh (now that I say that I realize I could also call that script... but might be worth keeping this self contained before it gets too entangled again)

The GitHub Action will need to

Clone the repo, with submodules
- I'd like to limit it to only runtime submodules, need at least lit.py installed for that
Download the release and unzip the iree-dist files (OR pull them from another action, like pkgci_build_packages.yml)
Run the script (under Docker?)

We should then be able to use that for "test_cpu", "test_gpu_nvidia_a100", "test_gpu_nvidia_t4" and "test_gpu_amd_???" jobs (all Linux, but Windows/macOS could also work)

Could then include Python packages in the tests too and fold the "test_tf_integrations_gpu" job and other Python tests in as well.

The remaining jobs will need something else - probably just a different pile of GitHub Actions yaml...

build_benchmark_tools
build_e2e_test_artifacts
cross_compile_and_test
build_and_test_android
test_benchmark_suites

Jan 31 '24 23:01 ScottTodd

Nice, that test script is working with iree-dist from a release.

Here's the workflow file: https://github.com/ScottTodd/iree/blob/infra-test-pkg/.github/workflows/test_package.yml

A few sample runs on a standard GitHub Actions Linux runner (CPU tests only):

https://github.com/ScottTodd/iree/actions/runs/7733994097/job/21087164304
https://github.com/ScottTodd/iree/actions/runs/7734158188/job/21087643265

(100% tests passed, 0 tests failed out of 647 🥳 -- though that does filter a few preexisting and new test failures)

Next I'll give that a coat of paint (organize the steps, prefetch Docker, trim the build, make the input package source configurable, etc.).

Feb 01 '24 00:02 ScottTodd

On the topic of CI optimization, I've been wondering what to do with the CI jobs that are required to build from source:

Job name	Approx timing	Details
`build_all`	7m20s
`build_test_all_bazel`	2m30s
`build_test_all_arm64`	6m50s	(postsubmit)
`build_test_all_windows`	14m30s	(postsubmit)
`build_test_all_macos_arm64`	6m50s	(postsubmit)
`build_test_all_macos_x86_64`	28m60s	(postsubmit)
`build_test_runtime`	1m30s
`build_test_runtime_arm64`	1m50s
`build_test_runtime_windows`	2m50s
`python_release_packages`	17m40s
`asan`	19m00s	(two configs)
`tsan`	14m00s	(two configs)
`small_runtime`	1m10s
`gcc`	4m50s
`tracing`	2m10s
`debug`	15m20s
`byo_llvm`	6m40s

(timings taken with a sample size of 1 from https://github.com/openxla/iree/actions/runs/7750585730)

We should be able to organize the build_ jobs into a matrix
The various configs (asan/gcc/debug/byo_llvm) are pretty slow to build, even with 99% cache hit rates. Sanitizers are really useful, but I want the critical path for most PRs to be much lighter :/

Feb 02 '24 19:02 ScottTodd

(would also love to fix the names at some point)

Feb 02 '24 19:02 benvanik

Sanitizers are really useful, but I want the critical path for most PRs to be much lighter :/

Here's an idea... what if we moved a chunk of those jobs to postsubmit (as a trial?) and watched for how often they fail or developers ask for them on presubmit? 🤔

Feb 02 '24 19:02 ScottTodd

I feel like the asan one runs more tests than others and often catches things. if we had better hygiene around this stuff and didn't just tack on tests to random bots then I'd feel better about that, but as it is asan often catches things for me that nothing else does. it's also the only one we have leak checks for. tsan is much less important.

if you remove the extra tests that are in asan and put them someplace else I bet the time goes down.

Feb 02 '24 19:02 benvanik

True, the asan job is the only one running the tests in https://github.com/openxla/iree/tree/main/llvm-external-projects/iree-dialects 😛 (that directory will eventually be folded into the main tree though)

Feb 02 '24 19:02 ScottTodd

I think I have line of sight to completing this, or at least getting it close to the finish line.

Already done:

Structured CMake tests such that integration tests can be run using a runtime build (-DIREE_BUILD_COMPILER=OFF) and a release package (install dir, or iree-dist)
Proof of concept workflows for running integration tests with packages
Migrated benchmark and cross-compile jobs to using install-dir instead of build-dir

Ideas for next steps:

Drop remaining uses of build-dir
- Run [compiler] unit tests as part of build_all
- Migrate test_all, test_gpu, and test_a100 to use install-dir
- Move test_tf_integrations and test_tf_integrations_gpu into pkgci.yml using Python releases (or fold into pkgci_regression_test_*.yml?)
- Stop creating and uploading build-dir
Migrate jobs to pkgci.yml
- Build iree-dist in pkgci.yml at the head of the pipeline
- Land workflow for running tests given a package path (either release url or previous run artifact)
- Move test_all, test_gpu, and test_a100 from ci.yml to pkgci.yml (with new names, using different Dockerfiles, etc.)
Make the CI faster
- Optimize hal/cts/ build time so runtime builds with tests enabled are faster on machines without many CPU cores
Figure out what to do with the assorted jobs remaining in ci.yml
Create a CMake subproject for integration tests that includes the HAL CTS (and Vulkan/CUDA "dynamic_symbols_test"s?), some lit tests, and check tests. That would better separate test configuration code from core project code

Feb 02 '24 23:02 ScottTodd

Made more progress today. After a few more PRs land I can remove the build-dir archive/upload steps.

After that I want to try

using workflow artifacts instead of GCS for the install dir. https://docs.github.com/en/actions/using-workflows/storing-workflow-data-as-artifacts
switching some jobs to using GitHub-hosted runners (lower disk space on those, so need to be careful - see https://github.com/openxla/iree/pull/16200)

Feb 06 '24 00:02 ScottTodd

Leaving myself some notes before I context switch for a bit...

I want to move the "integration test" jobs from ci.yml to pkgci.yml. That will require switching them from using the "install dir" that is currently passed via GCS (gcloud storage ...) to using "iree-dist" (or the Python packages) passed via either GCS or GitHub Workflow Artifacts.

I have (very early) progress on my infra-pkgci-fork, infra-pkgci-venv, and infra-linux-dist branches. Ideas:

Add a build_linux_dist.sh script next to build_tools/pkgci/build_linux_packages.sh. That can build iree-install-dist-stripped by default, or iree-install-tools-runtime for faster iteration on standard runners in forks
- This could share toolchain files and hopefully caches with the package build, or run in parallel on a separate runner
Make workflows like .github/workflows/pkgci_regression_test_cpu.yml able to run from either artifacts from a prior job OR artifacts from a github release in a given repo. That would let me iterate on workflows from a fork by pointing at the artifacts from a nightly release in the main repo
Make build_tools/pkgci/setup_venv.py portable to Windows / using windows_AMD64 instead of linux_x86_64 from a release so I can iterate on Python workflows locally

For GCS for workflow artifacts, I want to check how large the install/dist directories are, if we can make them smaller, what limits GitHub imposes, and how fast upload/downloading is. We can also set shorter artifact retention policies if storage starts incurring costs.

Feb 07 '24 18:02 ScottTodd

https://github.blog/2024-02-12-get-started-with-v4-of-github-actions-artifacts/ looks very promising for switching from GCS storage to using workflow artifacts. Uploads and downloads for large files used be pretty slow, but they seem to be much faster now.

Feb 14 '24 00:02 ScottTodd

Passing around the "install" directory instead of release packages is still a bottleneck for GPU test jobs. Would be nice to complete this refactoring.

This recent test_amd_mi250 run took 1m30s to download the 3.2GB install dir then another 41s to extract it: https://github.com/iree-org/iree/actions/runs/9910771644/job/27389176075#step:4:1
This other Regression Test / test_onnx :: nvidiagpu_cuda run took 4s to download the 72MB release (no debug symbols though): https://github.com/iree-org/iree/actions/runs/9911468532/job/27390310065#step:5:22

Jul 12 '24 20:07 ScottTodd