[MHLO] Init end-to-end unit tests
See RFC https://github.com/llvm/torch-mlir/issues/999
Co-authored-by: Bairen Yi [email protected] Co-authored-by: Jiawei Wu [email protected] Co-authored-by: Tianyou Guo [email protected] Co-authored-by: Xu Yan [email protected] Co-authored-by: Ziheng Jiang [email protected]
As @silvasean mentioned https://github.com/llvm/torch-mlir/pull/1025#issuecomment-1178465875 before. This PR adds the MHLO end-to-end unit tests to CI. It lowers MHLO to Linalg and run it on Linalg-On-Tensors backend. Please review for me @silvasean @ZihengJiang @Vremold
@silvasean I can't reproduce the CI failure locally with the following environments:
Collecting environment information...
PyTorch version: 1.13.0.dev20220814+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A
OS: Ubuntu Kinetic Kudu (development branch) (x86_64)
GCC version: (Ubuntu 11.3.0-5ubuntu1) 11.3.0
Clang version: 14.0.6-2
CMake version: version 3.24.0
Libc version: glibc-2.35
Python version: 3.10.5 (main, Jun 8 2022, 09:26:22) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-108-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.23.2
[pip3] torch==1.13.0.dev20220814+cpu
[pip3] torchvision==0.14.0.dev20220814+cpu
[conda] Could not collect
My testing script is:
cmake -GNinja -Bbuild \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_C_COMPILER_LAUNCHER=ccache \
-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
-DCMAKE_LINKER=lld \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_EXTERNAL_PROJECTS="torch-mlir;torch-mlir-dialects" \
-DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
-DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR="${PWD}/externals/llvm-external-projects/torch-mlir-dialects" \
-DLLVM_TARGETS_TO_BUILD=host \
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
-DTORCH_MLIR_ENABLE_LTC=ON \
-DTORCH_MLIR_USE_INSTALLED_PYTORCH="ON" \
-DPython3_EXECUTABLE="$(which python)" \
externals/llvm-project/llvm
cmake --build build
bash build_tools/write_env_file.sh
bash tools/torchscript_e2e_test.sh -c mhlo --verbose 2>&1 | tee test.log
@fortianyou . I'm about to send a PR that dockerizes CI. This should help with local reproducers. Once that's out, could you try to rebase on that and then run these tests locally? It should hopefully eliminate any environmental issues and enable a robust reproducer.
@fortianyou Here it is: https://github.com/llvm/torch-mlir/pull/1225. Please LMK once you rebase if you are able to repro locally.
Once that's out, could you try to rebase on that and then run these tests locally?
@sjain-stanford Thanks! I would love to do that.
I can't reproduce the CI failure locally with the following environments
The CI is failing because one of the e2e tests is failing on an assertion. For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong. (See the assertion error here: https://github.com/llvm/torch-mlir/runs/7840097389?check_suite_focus=true#step:12:9)
If you run the tests sequentially, you should also see the assertion error locally, and it should crash the entire program, allowing you to debug further.
When I run locally python -m e2e_testing.torchscript.main -v -c mhlo -s, I get the error:
python: /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280: llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]: Assertion `idx < size()' failed.
fish: Job 1, 'python -m e2e_testing.torchscri…' terminated by signal SIGABRT (Abort)
Here is the relevant backtrace
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1 0x00007ffff7c35546 in __GI_abort () at abort.c:79
#2 0x00007ffff7c3542f in __assert_fail_base (fmt=0x7ffff7dabdf8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fff3e0f9710 "idx < size()",
file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, function=<optimized out>) at assert.c:92
#3 0x00007ffff7c44222 in __GI___assert_fail (assertion=0x7fff3e0f9710 "idx < size()",
file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280,
function=0x7fff3dd5c227 "llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]")
at assert.c:101
#4 0x00007fff41419c59 in llvm::SmallVectorTemplateCommon<long, void>::operator[] (this=0x7fffffff91f0, idx=18446744073709551614)
at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280
#5 0x00007fff4712709d in mlir::mhlo::ConcatenateOp::inferReturnTypes (location=..., operands=..., attributes=..., regions=..., inferredReturnTypes=...)
at /usr/local/google/home/ramiroleal/torch-mlir/externals/mlir-hlo/lib/Dialect/mhlo/IR/hlo_ops.cc:3778
#6 0x00007fff4717e53a in mlir::mhlo::ConcatenateOp::build (odsBuilder=..., odsState=..., val=..., dimension=18446744073709551614)
at tools/torch-mlir/mlir-hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.cc.inc:6846
#7 0x00007fff46ebd4e3 in mlir::OpBuilder::create<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa378, location=...,
args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/Builders.h:455
#8 0x00007fff46ebd3ef in mlir::RewriterBase::replaceOpWithNewOp<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa370, op=0x55555a1d6e30,
args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/PatternMatch.h:452
#9 0x00007fff46eae374 in (anonymous namespace)::ConvertAtenOp<mlir::torch::Torch::AtenCatOp>::matchAndRewrite (this=0x55555a1bb6d0, op=..., adaptor=..., rewriter=...)
at /usr/local/google/home/ramiroleal/torch-mlir/lib/Conversion/TorchToMhlo/Basic.cpp:999
Note: There are other tests also causing assertion errors. If you run the tests in parallel, you should see the assertion error messages print out before the results are printed out.
Let me know if you're able to reproduce things.
For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong.
I've seen better failures when disabling multiprocessing (with the -s flag you use above)
(apologies for the accidental closing with a comment :P )