tpp-mlir
tpp-mlir copied to clipboard
tpp-run does not support ml_program dialect
Tried running torch-mlir
exported ResNet in linalg-on-tensor
via tpp-run
and found a crash. tpp-opt
works fine though.
Commands (Install torch-mlir
using pip
)
$ python examples/torchscript_resnet18_all_output_types.py
$ tpp-opt rn18.mlir -o rn18.mlir.opt
$ tpp-run rn18.mlir.opt -e forward -entry-point-result=void
Error
$ ./tpp-run -e forward -entry-point-result=void rn18.mlir.opt
loc("rn18.mlir.opt":9:3): error: cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface` registration for dialect for op: ml_program.global
tpp-run: /nfs_home/nhasabni/other/tensor_compiler/nhasabni_tpp-sandbox/tools/tpp-run/tpp-run.cpp:199: std::unique_ptr<llvm::Module> lowerToLLVMIR(mlir::Operation *, llvm::LLVMContext &): Assertion `llvmModule' failed.
PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.
Stack dump:
ml_program
seems to be a dead-end in the upstream MLIR.
The basic dialect ops are defined (that's why tpp-opt
is fine with it), however, there are no conversion passes or any further integration. The dialect seems like a stub for frontend conversion but not much more.
IREE has custom lowering passes for ml_program
(see iree/compiler/MHLO/MHLOToLinalgOnTensors.cpp
) but I see nothing relevant available upstream.
Looking at this rn18 example from torch-mlir, it seems like the one ml_program.global
variable isn't used anywhere. So, I hope we can get away with some minor IR cleanup are run the rest as is.
@nhasabni do you think that's feasible or lack of ml_program
lowering might become a blocker in near future?
Could this be an upstream pass to simple bufferize ml_program.global
to memref.global
?
Could this be an upstream pass to simple bufferize
ml_program.global
tomemref.global
?
I'm not familiar with ml_program
use cases but probably. Or to a dense tensor as you usually enter at that abstraction level.
Question is whether it should be needed at all. Maybe it's just some torch-mlir artefact/leftover.
I see a few ways to work on this:
- Try to upstream some conversion/bufferization pass to
ml_program
. This is quick and should be uncontroversial, unless people are already preparing to kill that dialect. - Work upstream (RFC on LLVM) to kill that dialect and get the other tools (RFC on torch-mlir) to stop generating it. This is slower, but if it's the path others are leading towards, it's the best outcome.
- If all else fails, add a local pass downstream. This is by far the worst solution, but lets us "worry about this problem" at a later date, and perhaps even use this as a PoC to what the problems really are.
I recommend we work in that order.
Just to update this conversation:
- I see that
torch-dynamo
support intorch-mlir
also ends up generating MLIR for input ML models that containsml_program
. I found thatml_program
usage in these MLIR files is not dead code. See attached MNIST example. - I also found that there are no upstream conversation patterns for
ml_program
. IREE seems to contain that code - https://github.com/openxla/iree/blob/9c424c4f4b0ebbba8c47543efb168cadb6e1e07c/compiler/src/iree/compiler/InputConversion/Common/ImportMLProgram.cpp#L82 - It looks like IREE folks were pushing for this dialect and contributed to MLIR upstream - https://discourse.llvm.org/t/rfc-introduce-ml-program-dialect-and-top-level-ops-proposal-v2/60907
So bottom line is - if we want to get PyTorch2 models imported via torch-mlir
to work with tpp-opt
, we would need to get ml_program
dialect working correctly. Currently, I am facing a problem with MNIST example attached.
$ tpp-opt -default-tpp-passes mnist_with_mlprog.mlir > /tmp/x
mnist_with_mlprog.mlir:95:11: error: op was not bufferized
%18 = ml_program.global_load @global_seed : tensor<i64>
^
https://github.com/llvm/llvm-project/pull/75103