Add new framework coverage in system level test suites
Overview
Let's add more test coverage for full programs, individual ops/layers, and other system-level behavior coming from frameworks like PyTorch, JAX, TF/TFLite, ONNX, etc.
"SHARK-TestSuite/e2eshark"
https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark has a new test suite that looks like a good fit for this.
Open questions
Brainstorming some things to consider:
- [ ] How many test cases does the suite contain?
- [ ] How long does the test suite take to run?
- [ ] Can the tests be divided into categories such as "ops", "small models", "large models"?
- [ ] What dependencies does the test suite have? (Python versions, pip package versions, etc.)
- [ ] Can the tests run using only release packages, or do they have source code or build directory dependencies?
- [ ] Do the tests require large downloads? If so, where are the necessary files hosted?
- [ ] Can individual tests be easily marked XFAIL and/or disabled?
- [ ] When a test fails, does it output enough logs and other artifacts for developers to triage the nature and cause of the failure?
- [ ] How should the test suite be included in the project? Added as a git submodule but maintained in a separate repo? Forked into IREE? Kept in a separate repo entirely (like https://github.com/iree-org/iree-samples/blob/main/.github/workflows/regression_tests.yml)?
Existing test suites
We have these e2e (compiler -> runtime -> correctness checking) test suites today:
Those are run on our CI using these workflows:
.github/workflows/pkgci_regression_test_cpu.yml.github/workflows/pkgci_test_tensorflow_cpu.yml.github/workflows/ci.yml(e.g.test_gpu)
Related discussions
- Discord
SHARK-TestSuite/e2esharkmention, feedback, kicking off planning
Well, https://github.com/iree-org/iree-samples/actions/workflows/regression_tests.yml / https://github.com/iree-org/iree-samples/blob/main/.github/workflows/regression_tests.yml (out-of-tree, unmonitored) is probably not a great reference - that's been failing for 7 months without much attention.
I started looking through https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark in more detail today.
- I'm not sure how much reusable from https://github.com/nod-ai/SHARK-TestSuite/blob/main/e2eshark/run.py or the setup instructions in the README. That's all pretty bespoke and anchored on using developer tools from source builds. Some of that can be worked through step by step (e.g. https://github.com/nod-ai/SHARK-TestSuite/issues/49), but the script is only 1000 lines of Python and a rewrite with a focus on project-specific (or at least release-package-specific) infrastructure/usage feels easier.
- We could wrap a new test runner around the Python test case sources like those in https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark/onnx and https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark/pytorch.
We could also lift some test cases from https://github.com/llvm/torch-mlir/tree/main/projects/pt1/python/torch_mlir_e2e_test/test_suite like what @rsuderman is doing with https://github.com/llvm/torch-mlir/pull/2795.
experimental/regression_suite has approximately the shape of what I'd like to see in a test runner + suite configuration. A few things potentially missing
build_tools/pkgci/setup_venv.pyand pkgci is currently only configured for Linux, and I'd like to iterate from Windows- Workflows like https://github.com/openxla/iree/blob/main/.github/workflows/pkgci_regression_test_cpu.yml can run from workflow artifacts in the same repository artifacts, but I'd like the ability to iterate from a fork or using release artifacts
- The suite operates at the level of "compiler input", so we'll want to choose a path (or several paths) from frontends to that input MLIR or let the test suite also use tools like
iree-import-onnxor packages likeiree-turbine- Existing files (source, input, output) are hosted like this: https://github.com/openxla/iree/blob/c02b89e3c7e22eff009fc318132b5ed3fe9a2d97/experimental/regression_suite/tests/pregenerated/test_llama2.py#L21-L24 https://github.com/openxla/iree/blob/c02b89e3c7e22eff009fc318132b5ed3fe9a2d97/experimental/regression_suite/tests/pregenerated/test_llama2.py#L249-L257
- Hosting in cloud storage seems excessive for individual ops like those in https://github.com/llvm/torch-mlir/tree/main/projects/pt1/python/torch_mlir_e2e_test/test_suite - seems like we could generate those in-process using PyTorch +
iree-import-onnx(or Turbine, or torch-mlir)
Seems like pkgci is running 'presubmit' tests on presubmit and postsubmit, and never runs 'postsubmit' tests? https://github.com/openxla/iree/blob/c02b89e3c7e22eff009fc318132b5ed3fe9a2d97/.github/workflows/pkgci_regression_test_cpu.yml#L51-L56
The ONNX test suite looks promising:
- https://onnx.ai/onnx/repo-docs/ImplementingAnOnnxBackend.html
- https://onnx.ai/onnx/repo-docs/OnnxBackendTest.html
- (sources) https://github.com/onnx/onnx/blob/main/onnx/backend/test/case/node/matmul.py
- (generated) https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/node
- (generated) https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/node/test_matmul_2d
That style of having Python scripts generate model.onnx and input/output protobuf files matches what we have in experimental/regression_suite pretty closely.
There are plenty of test cases that are missing (large shapes, combinations of ops, constructs not represented in ONNX, etc.), but leveraging some part of that test suite would help with overall confidence in end to end compilation and execution support.
I've made progress on this using this script: https://github.com/ScottTodd/iree/blob/tests-regression-onnx/experimental/regression_suite/scripts/run_onnx_tests.py
That uses a checkout of the ONNX repo to convert their generated test cases from this format:
onnx/backend/test/data/node/...
test_foo/
model.onnx
test_data_set_0/
input_0.pb
output_0.pb
test_bar/
model.onnx
test_data_set_0/
input_0.pb
output_0.pb
to IREE test cases with this format:
converted_dir_path/...
test_foo/
model.mlir (torch-mlir)
input_0.npy
output_0.npy
test_data_flags.txt (flagfile with --input= --expected_output=)
test_bar/
model.mlir
input_0.npy
output_0.npy
test_data_flags.txt
and then runs those test cases on a backend with these extra files:
converted_dir_path/...
test_foo/
model.mlir
input_0.npy
output_0.npy
test_data_flags.txt
+ module_cpu.vmfb
+ config_cpu_flags.txt (flagfile with --device= --module=)
and run commands like this:
$ cd converted_dir_path/test_foo
$ iree-run-module --flagfile=config_cpu_flags.txt --flagfile=test_data_flags.txt
Early results running the full suite are here: https://gist.github.com/ScottTodd/dca2b46fdc9a0d2a2ab39509bf5478c4 (some errors are configuration errors are some are legit errors during compilation or execution - I haven't triaged these much yet)
Next steps:
- [ ] Push the generated .mlir, .npy, and .txt files to a git repo (possibly with LFS)
- [ ] Split the script into stages ("generate", "compile", and "run")
- [ ] Plug the "compile" and "run" steps in to a CI job
- install IREE artifacts from pip / GitHub artifacts
- clone the test suite repo
- execute compile/run steps and report results
- [ ] Note which tests are passing/failing in some format (ctest / pytest XFAIL, text file, Python file, etc. etc.)
Once we have that framework, we can plug other test suites in by having them generate similar folders filled with test cases. For larger programs, we could have the files be fetchable from cloud storage or generated by Python scripts during the test, but I think we can get reasonably far with direct file inclusions (perhaps with splat parameters for larger programs).
- [ ] Push the generated .mlir, .npy, and .txt files to a git repo (possibly with LFS)
Regarding this, I'm also toying with the idea of taking a smaller number of tests and putting them directly in IREE for easier developer use. That could replace the existing e2e test folders like https://github.com/openxla/iree/tree/main/tests/e2e/stablehlo_ops, or just live alongside them.
I also tried running the https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/simple tests, which are a step higher level than the "node" tests. 2/23 of those passed, but I don't think the models themselves there are particularly interesting. https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/real and https://github.com/onnx/onnx/tree/main/onnx/backend/test/data/light may be useful but they would require a bit more scripting.
I have an initial sketch of a new test suite and supporting scripts up at https://github.com/nod-ai/SHARK-TestSuite/pull/68 . That will go through a few iterations of refinement then I'd like to have some workflows / jobs in IREE that run slices of the test suite (nightly, on postsubmit, on presubmit, etc.).
More progress 😄! I have a GitHub Actions workflow and rough sketch of pytest support now. Stacking changes at https://github.com/ScottTodd/SHARK-TestSuite/tree/iree-pytest for now.
Here's a sample workflow run with a known failure to show the output style: https://github.com/ScottTodd/SHARK-TestSuite/actions/runs/8026248257/job/21928387546?pr=1 . That uses a conftest.py file (here: https://github.com/ScottTodd/SHARK-TestSuite/blob/iree-pytest/iree_tests/onnx/basic/conftest.py) to collect test cases from the test suite directory, teach pytest how to run them, and format error messages as needed.
Next steps
parameterize tests across backends / devices.
I'm thinking about writing a config file (JSON/YAML/Python/txt format TBD... protobuf? 👀 hah) and then loading config files during test collection.
You could then say things in a config file like:
- "Compiler flags for this config are
--iree-hal-target-backends=llvm-cpu" - "Runtime flags for this config are
--device=local-task" - "Include these tests" (or all)
- "Exclude these other tests"
- "Expect these tests to fail"
- https://pytest.org/en/7.4.x/how-to/skipping.html -- may also want to support conditional xfail, e.g. based on operating system
- "Only run these tests on hardware with these features" (or maybe this is handled up a level - don't pass a config file that isn't supported)
compile then run
I only have the pytest code using iree-compile right now. Also need iree-run-module (and later iree-benchmark-module?)
I have two approaches in mind and will need to learn more about pytest / fixtures to proceed:
- If a test is expected to compile successfully, create a single test case that compiles then runs. Otherwise, create a single test case that only tries to compile.
- Create two test cases, having the second test case depend on the first. If compilation fails as expected then the runtime test could be marked intentionally skipped?
- See how
@pytest.fixtureis used in files likeexperimental/regression_suite/tests/pregenerated/test_llama2.py
- See how
I'd like for test reports to show accurate numbers for pass/fail/skipped even across stages, if possible. Need to think a bit about how totals are shown (e.g. 800/1000 import, 600/800 compile, 400/600 run correctly -- that would be 400/1000 "working e2e")
Got a simple config file (written in JSON for now, can reach for something more complex as needed): https://gist.github.com/ScottTodd/b93c81aa95fdc1d644596b931ceac05e Then ran with https://pypi.org/project/pytest-xdist/:
D:\dev\projects\SHARK-TestSuite (iree-pytest)
(.venv) λ pytest iree_tests -n auto
============================= test session starts =============================
platform win32 -- Python 3.11.2, pytest-8.0.2, pluggy-1.4.0
rootdir: D:\dev\projects\SHARK-TestSuite
plugins: xdist-3.5.0
64 workers [1077 items]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xxxx.x. [ 6%]
xxxxxxxx.xxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxx..x.xxx.xxxx.x......x.x.xxxxx.x [ 13%]
x...x..xx.xx.x......x..xxxxxx.x....xx.xx.x.xx...xxxxxx.xx.xx.xxxxxxxxxxx [ 20%]
xxxx.xxxxxx..xx.xxxx.....xxxxxxxxxx....x....x...x..........xxxxxx.xxx.x. [ 26%]
..x.......x.x......xx...xx.xxxx.xxxx.xxxx..x...x..x....x.x.x..xxxxxxxxxx [ 33%]
xx.xxxxx..x..x..xx..xx.....x.x..x.x....xxxxx..x.x...xx.x....x..x.xxxxxxx [ 40%]
x..xx.x.x...xx..xxx.x.xx.x.x...xx.xxx.xxxx.x.x.x..xxx..x.....xxxxxxxxxxx [ 46%]
xxx.xxx......x...xxxxxxxxx...........xxxxxxxxxxxxxxxxxx.xx.x.x.xxxxxx... [ 53%]
..xxxxx...xxxx..xx......x.xxx.xxxxx.xxx.x.xxxxxxxxxxxxxxx.xxxxxxxxxx.... [ 60%]
....x.xxxxxxxxxxxxxxx.xxxxxx.xx.xxx.xxx.xxxxx.xx....xxxx.x.xxxxxxxx..xxx [ 66%]
....xxxxxx.xx...x.x...xxxxx.xx.xxx.xxxxx....xxxxxx.xxxxxxxx.x.xxxxxxxxx.xxx [ 73%]
xxxxxx.xxxx..xxxx..x..xxxxxxx.xxxxx.xx...xxxxxxxxxxxxxxx.x.xxxxxxxxxxxxx [ 79%]
xxxxxxxxxxxx.xxx.xxxxxxxxxxxxxxxxxxxxxxxxx.xxxx.xxxxxxxxx..x.xxxxxxxxxxxx [ 87%]
xxxxxxxxxx..xx.xx.xxxxxxxxxxxxxxxxxxxxxxxx.xxxxxx..xxx.....x.xxxxx.....x [ 94%]
xxx.xxx.................xxxxxxxx...xxx..x.xxx................... [100%]
====================== 360 passed, 717 xfailed in 20.57s ======================
🚀
Added iree-run-module tests and a whole bunch of quality of life tweaks to the pytest configuration. Now works on GitHub Actions too: https://github.com/ScottTodd/SHARK-TestSuite/actions/runs/8056608699/job/22006010154
============================= test session starts ==============================
platform linux -- Python 3.11.8, pytest-8.0.2, pluggy-1.4.0
rootdir: /home/runner/work/SHARK-TestSuite/SHARK-TestSuite
plugins: xdist-3.5.0
created: 4/4 workers
4 workers [1047 items]
xx.x..x.xxxxxxxx.xxxxxxxxxx.x.xx.x.xxx.xx.xxx.xxx..xxxxxxx.x.xx..x.x.x.. [ 6%]
..xxxx.xxxx..xxxx..xx.xxxxxxxxxxxxxx.xxxxx.xx.xx.xxxxxxxxxxxx.xxxxxx.xxx [ 13%]
xxxx.xxxxxxxxxxxxxxxxxx.xx.xxxxxxx.xxxxxx.xxxxx.xx..xx.x..xxxx...xx.xxx. [ 20%]
xxxxxxx.xxxxxxxx.xxxxxx..xxxxxx..xxxxxxxxxxx.xxxxxxxxxxx.xxx.xxx.xxxx.x. [ 27%]
xxxxx.x.xxxxx.xxxxxx.xx.xx.xxxxx.xxx.xx.xx.x.x.x.xxxx.xx.xxxxxx..xxx.xxx [ 34%]
..xxxxxx..xxxxxxxxxx.xxxxxxx.xx.xxxx.xxx.xxxxxxx.xxxxxx..x.xxx.xxx.xxxxx [ 41%]
xx.xxxxxxxxxx.xxxxxx.xxxxx.xxxxxxx.xxxx.xxxx.xxxxxx.xxxxx.xxxxx.xx.xx... [ 48%]
x...x.xx.x.x.xx.x..x...x...x..x.xx.......xx.x.x.xx.....x....xx..xx.xxxxx [ 55%]
xxx.xxxx.x.xxxx..xxxx..xxxxxx.x.xxxxxxxxx.xxxxx.xxx.xxxxxx.x.xxxxxxxxxxx [ 61%]
.xxxxxxxxxxx.xxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxx [ 68%]
x.xxxxxxxxxxx.xxxxxxxxxx.xxxxxxxxxxx.xxxxx.xxxxxxxxxxx.xx.xxx.xxxxxxx.xx [ 75%]
xx.xxx.x.xxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxx [ 82%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.x.xxx.xxx.xxx..xxx..xxxx....xx...xx...xx.x [ 89%]
.xxxxxx.xxx.xx.xxxx.x.xxxxxxx.xxxxx.xx.x.xx...xxxxxx.x.x.x....xx..xxxxxx [ 96%]
.x.xxxxxx.x.xxxxxxxxxxxxxxxx..x....... [100%]
================= 238 passed, 809 xfailed in 80.27s (0:01:20) ==================
Total CI time 1m45s on a standard GitHub-hosted runner (ubuntu-latest).
The StableHLO tests from here: https://github.com/openxla/stablehlo/tree/main/stablehlo/tests/interpret could be used if we wrote a converter script / (MLIR pass?) to replace the check ops with inputs and expected outputs (or just 0 inputs + return and check expected outputs)
e.g. https://github.com/openxla/stablehlo/blob/main/stablehlo/tests/interpret/add.mlir
// RUN: stablehlo-translate --interpret -split-input-file %s
func.func @add_op_test_si4() {
%0 = stablehlo.constant dense<[0, 1, 2, -3, 0]> : tensor<5xi4>
%1 = stablehlo.constant dense<[-8, -1, 2, -3, 7]> : tensor<5xi4>
%2 = stablehlo.add %0, %1 : tensor<5xi4>
check.expect_eq_const %2, dense<[-8, 0, 4, -6, 7]> : tensor<5xi4>
func.return
}
// -----
func.func @add_op_test_ui4() {
%0 = stablehlo.constant dense<[0, 2]> : tensor<2xui4>
%1 = stablehlo.constant dense<[15, 3]> : tensor<2xui4>
%2 = stablehlo.add %0, %1 : tensor<2xui4>
check.expect_eq_const %2, dense<[15, 5]> : tensor<2xui4>
func.return
}
(could also do the same for IREE's existing 'check' tests)
good idea - a pass converting out of that form to the same kind of thing you're doing with onnx seems like the minimal change set and pattern for even more test suites others could follow.
As these new tests prove themselves, I'm actually keeping an eye on the check dialect in IREE (https://iree.dev/reference/mlir-dialects/Check/) and if it and the associated infrastructure are worth keeping around. We've long wanted to refactor it in various ways. I think the main thing the 'check' dialect provides that iree-run-module --expected_output can't is the ability to check values at any point within a function. I don't think we ever actually use that full power though (and I'd be a bit concerned if we did - it's not a pattern that real programs would likely use). The 'check' tests in CMake do provide wide coverage across platforms and configurations, but we can get that with a (hopefully) simpler setup using iree-run-module and pytest.
Initial work has landed and tests are running. Spotted one apparent flake already: https://github.com/openxla/iree/actions/runs/8086418421/job/22096503710?pr=16603#step:9:41 (compiler crash on FAILED SHARK-TestSuite/iree_tests/onnx/node/generated/test_layer_normalization_2d_axis1/model.mlir::gpu_rocm_rdna3)
Current status:
- Most ONNX ops were imported successfully from the upstream test suite. There are 48 import failures: https://github.com/nod-ai/SHARK-TestSuite/blob/main/iree_tests/onnx/node/import_failures.txt (a mix of sequence types, optional types, and a few edge cases), which are not included in these CI runs (as importing is currently an offline step)
- ONNX op tests are running on presubmit, now with working XFAIL/XPASS behavior
- Config files with XFAIL lists are a bit of a chore to update, as they are ~900 lines (example file) and moving test cases between categories can require re-running the test suite a few times to verify that every case settled into its proper place
Next, I'll start pulling more "full program" tests into the test suite. See also a recent discussion here on Discord
- I'll likely start with https://github.com/nod-ai/SHARK-TestSuite/tree/main/e2eshark/pytorch/models, but I could also pull from the models that IREE already runs in-tree benchmarks with.
- I'm going to try using .irpa parameter files (https://iree.dev/guides/parameters/) with splats to keep the file sizes small enough for including directly in the SHARK-TestSuite git repository (using git LFS). For testing execution correctness, that will need some extra validation, since we won't be able to directly compare with a reference backend/framework after converting weights into splats
Going to call this fixed now. We're continuing to add
- test cases under https://github.com/nod-ai/SHARK-TestSuite/tree/main/iree_tests
- configs at https://github.com/openxla/iree/tree/main/build_tools/pkgci/external_test_suite
- CI jobs at e.g. https://github.com/openxla/iree/blob/main/.github/workflows/pkgci_regression_test_cpu.yml