Need E2E ONNX op tests in CI
Problem: We don't test onnx ops in our CI
- sometimes bad onnx lowering slip past code review
- SHARK-TestSuite catches some of them, but it's easy to improperly write test cases that don't actually run
- onnx lowerings regress and get missed because SHARK-TestSuite isn't run as a part of the CI
- While IREE, a downstream project, does test for onnx nodes numerically, sometimes it's hard to tell whether an onnx failure is caused by something in IREE or torch-mlir
For example, we have these Onnx Ops that have made their way into torch-mlir but ultimately don't run in IREE's test suites:
- LSTM (i wrote this and thought this worked based on SHARK-TestSuite!)
- STFT
- HardMax (and many, many more!)
If we have some onnx node tests in torch-mlir CI:
- if an op works, we know that it's a downstream problem with IREE
- if an op doesn't work, we know exactly why because the error messages and the failures will be right there inside the CI
- if an op regresses, we know exactly who & what is responsibl
Problems with existing solutions
our existing test-suite
We have an existing test-suite in projects/pt1 that imports a lot of pytorch ops, and performs numerical comparison with native pytorch via a variety of paths, including ONNX.
There are two main problems with these:
- some Onnx ops don't have pytorch analogues and cannot be tested here
- with ops that represent layers & carry weights, the existing testing infrastructure generates weights separately between pytorch and torch-mlir, causing the test cases to always fail numerically.
testing downstream in IREE
Our downstream project IREE does run good torch node tests, but reports many of the onnx ops that we've lowered as failing. I haven't found a way to view the error messages, and it's also hard to tell whether these failures are due to IREE or torch-mlir.
Proposed solution:
We should add a CI script and some testing scripts to torch-mlir that:
- downloads models and test inputs and reference outputs from the official ONNX op test-suite.
- @scotttodd has it converted to mlir and stored in SHARK-TestSuite here
- run these test cases
- report on CI
https://github.com/nod-ai/onnxruntime/tree/iree_ep/onnxruntime/core/providers/iree
Maybe we can use onnxruntime to directly plug into onnx's tests and not have to write additional data / model preprocessing scripts.
Our downstream project IREE does run good torch node tests, but reports many of the onnx ops that we've lowered as failing. I haven't found a way to view the error messages, and it's also hard to tell whether these failures are due to IREE or torch-mlir.
I archived some historical logs here:
- https://gist.github.com/ScottTodd/1a02531cc76a3b8566428207e39d1870
- https://gist.github.com/ScottTodd/ecc9c57c01bfc5e996a15cdd38df6a9c
At the time I decided that the full output would be too noisy to include on all CI runs. The list of failures may be small enough now to revise that decision. Generally, you can run pytest with -rA (https://docs.pytest.org/en/stable/how-to/output.html) to see output from XFAIL'd tests, or run with --ignore-xfails (see other custom flags in the conftest.py file).
@rsuderman got some references I could see on how to run torch-mlir and get numerical results w/o using IREE?
We had some good experience with the onnx.reference evaluator where onnxruntime would lack support for some ops or dtypes (e.g. bfloat16).
@renxida Hi! When you say that these ops fail, do you expect them to have linalg lowerings?
@vinayakdsci yup! I'm expecting them to work e2e.
In an ideal world, instead of pushing many ops through layer by layer, then coming back to try to push them through the next layer while trying to remember how our old implementations work, I'd like us to push each op through the whole way before moving on to the next thing.
@renxida I agree :) But I just wanted to point this out that many ops could be failing because of missing torch to linalg lowerings. And don't worry, I am sure we will be able to push them through!