torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

[MHLO] Init end-to-end unit tests

Open tanyokwok opened this issue 3 years ago • 8 comments

See RFC https://github.com/llvm/torch-mlir/issues/999

Co-authored-by: Bairen Yi [email protected] Co-authored-by: Jiawei Wu [email protected] Co-authored-by: Tianyou Guo [email protected] Co-authored-by: Xu Yan [email protected] Co-authored-by: Ziheng Jiang [email protected]

tanyokwok avatar Aug 14 '22 15:08 tanyokwok

As @silvasean mentioned https://github.com/llvm/torch-mlir/pull/1025#issuecomment-1178465875 before. This PR adds the MHLO end-to-end unit tests to CI. It lowers MHLO to Linalg and run it on Linalg-On-Tensors backend. Please review for me @silvasean @ZihengJiang @Vremold

tanyokwok avatar Aug 15 '22 03:08 tanyokwok

@silvasean I can't reproduce the CI failure locally with the following environments:

Collecting environment information...
PyTorch version: 1.13.0.dev20220814+cpu
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu Kinetic Kudu (development branch) (x86_64)
GCC version: (Ubuntu 11.3.0-5ubuntu1) 11.3.0
Clang version: 14.0.6-2
CMake version: version 3.24.0
Libc version: glibc-2.35

Python version: 3.10.5 (main, Jun  8 2022, 09:26:22) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-108-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.23.2
[pip3] torch==1.13.0.dev20220814+cpu
[pip3] torchvision==0.14.0.dev20220814+cpu
[conda] Could not collect

My testing script is:

 cmake -GNinja -Bbuild \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_C_COMPILER=clang \
    -DCMAKE_CXX_COMPILER=clang++ \
    -DCMAKE_C_COMPILER_LAUNCHER=ccache \
    -DCMAKE_CXX_COMPILER_LAUNCHER=ccache \
    -DCMAKE_LINKER=lld \
    -DLLVM_ENABLE_ASSERTIONS=ON \
    -DLLVM_ENABLE_PROJECTS=mlir \
    -DLLVM_EXTERNAL_PROJECTS="torch-mlir;torch-mlir-dialects" \
    -DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
    -DLLVM_EXTERNAL_TORCH_MLIR_DIALECTS_SOURCE_DIR="${PWD}/externals/llvm-external-projects/torch-mlir-dialects" \
    -DLLVM_TARGETS_TO_BUILD=host \
    -DMLIR_ENABLE_BINDINGS_PYTHON=ON \
    -DTORCH_MLIR_ENABLE_LTC=ON \
    -DTORCH_MLIR_USE_INSTALLED_PYTORCH="ON" \
    -DPython3_EXECUTABLE="$(which python)" \
    externals/llvm-project/llvm

cmake --build build

bash build_tools/write_env_file.sh
bash tools/torchscript_e2e_test.sh -c mhlo --verbose 2>&1 | tee test.log

tanyokwok avatar Aug 15 '22 06:08 tanyokwok

@fortianyou . I'm about to send a PR that dockerizes CI. This should help with local reproducers. Once that's out, could you try to rebase on that and then run these tests locally? It should hopefully eliminate any environmental issues and enable a robust reproducer.

sjain-stanford avatar Aug 15 '22 15:08 sjain-stanford

@fortianyou Here it is: https://github.com/llvm/torch-mlir/pull/1225. Please LMK once you rebase if you are able to repro locally.

sjain-stanford avatar Aug 15 '22 15:08 sjain-stanford

Once that's out, could you try to rebase on that and then run these tests locally?

@sjain-stanford Thanks! I would love to do that.

tanyokwok avatar Aug 15 '22 16:08 tanyokwok

I can't reproduce the CI failure locally with the following environments

The CI is failing because one of the e2e tests is failing on an assertion. For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong. (See the assertion error here: https://github.com/llvm/torch-mlir/runs/7840097389?check_suite_focus=true#step:12:9)

If you run the tests sequentially, you should also see the assertion error locally, and it should crash the entire program, allowing you to debug further.

When I run locally python -m e2e_testing.torchscript.main -v -c mhlo -s, I get the error:

python: /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280: llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]: Assertion `idx < size()' failed.
fish: Job 1, 'python -m e2e_testing.torchscri…' terminated by signal SIGABRT (Abort)

Here is the relevant backtrace

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:49
#1  0x00007ffff7c35546 in __GI_abort () at abort.c:79
#2  0x00007ffff7c3542f in __assert_fail_base (fmt=0x7ffff7dabdf8 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, function=<optimized out>) at assert.c:92
#3  0x00007ffff7c44222 in __GI___assert_fail (assertion=0x7fff3e0f9710 "idx < size()", 
    file=0x7fff3ea0e118 "/usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h", line=280, 
    function=0x7fff3dd5c227 "llvm::SmallVectorTemplateCommon::reference llvm::SmallVectorTemplateCommon<long>::operator[](llvm::SmallVectorTemplateCommon::size_type) [T = long]")
    at assert.c:101
#4  0x00007fff41419c59 in llvm::SmallVectorTemplateCommon<long, void>::operator[] (this=0x7fffffff91f0, idx=18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/include/llvm/ADT/SmallVector.h:280
#5  0x00007fff4712709d in mlir::mhlo::ConcatenateOp::inferReturnTypes (location=..., operands=..., attributes=..., regions=..., inferredReturnTypes=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/mlir-hlo/lib/Dialect/mhlo/IR/hlo_ops.cc:3778
#6  0x00007fff4717e53a in mlir::mhlo::ConcatenateOp::build (odsBuilder=..., odsState=..., val=..., dimension=18446744073709551614)
    at tools/torch-mlir/mlir-hlo/include/mlir-hlo/Dialect/mhlo/IR/hlo_ops.cc.inc:6846
#7  0x00007fff46ebd4e3 in mlir::OpBuilder::create<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa378, location=..., 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/Builders.h:455
#8  0x00007fff46ebd3ef in mlir::RewriterBase::replaceOpWithNewOp<mlir::mhlo::ConcatenateOp, mlir::ValueRange, unsigned long> (this=0x7fffffffa370, op=0x55555a1d6e30, 
    args=@0x7fffffff9698: 18446744073709551614, args=@0x7fffffff9698: 18446744073709551614)
    at /usr/local/google/home/ramiroleal/torch-mlir/externals/llvm-project/llvm/../mlir/include/mlir/IR/PatternMatch.h:452
#9  0x00007fff46eae374 in (anonymous namespace)::ConvertAtenOp<mlir::torch::Torch::AtenCatOp>::matchAndRewrite (this=0x55555a1bb6d0, op=..., adaptor=..., rewriter=...)
    at /usr/local/google/home/ramiroleal/torch-mlir/lib/Conversion/TorchToMhlo/Basic.cpp:999

Note: There are other tests also causing assertion errors. If you run the tests in parallel, you should see the assertion error messages print out before the results are printed out.

Let me know if you're able to reproduce things.

ramiro050 avatar Aug 15 '22 16:08 ramiro050

For some reason this causes a cascade of failures in the CI, which can make it hard to tell what is going wrong.

I've seen better failures when disabling multiprocessing (with the -s flag you use above)

sjain-stanford avatar Aug 15 '22 20:08 sjain-stanford

(apologies for the accidental closing with a comment :P )

sjain-stanford avatar Aug 15 '22 20:08 sjain-stanford