torch-mlir
torch-mlir copied to clipboard
python: trim registration and loading of dialects and passes
In the interest of merging upstream LLVM quickly, a previous patch (7f08169) updated the torch-mlir build to register all dialects and passes through Python bindings. This patch limits the dialects and passes to only those that are used in torch-mlir.
Key to this change are the removal of
MLIRPythonExtension.RegisterEverything
and the introduction of a new
Python module (_mlir_libs/_site_initialize_0.py
), where we register
the dialects and passes used by torch-mlir.
Release build now passes for this branch: https://github.com/llvm/torch-mlir/actions/runs/2843596925
Sorry, the usual CI run failed. Marking as draft while I diagnose and fix the problem.
Rebasing to see if PR https://github.com/llvm/torch-mlir/pull/1218 resolves the build failure.
I realized that hidden in the CI logs is an assertion failure:
python: /home/runner/work/torch-mlir/torch-mlir/externals/llvm-project/llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From *) [To = mlir::ReifyRankedShapedTypeOpInterface, From = mlir::Operation]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.
My local build did not enable assertions, so I couldn't reproduce the problem. Rebuilding with assertions set to ON now.
Looks like something broke the ref backend tests. The output from the failures is pretty slim and that limits any further insights I might have (I wish we biased towards verbose output on failure for such things -- much more CI friendly).
We actually have good errors here, but we are squelching them when running our e2e tests in parallel due to it being really hard to capture that info for a crash with Python multiprocessing
. I suppose for the CI we could run in single-threaded mode since we only have 2 cores anyways. But yes, this was a regression when multiprocessing this.
(the usual answer is to run single-threaded with -s, but that doesn't help on the CI or causes an extra iteration step)
You get 4 vCPUs on Linux and 3 for macOS
I think I may have found the culprit. The key change was to call mlir::tensor::registerInferTypeOpInterfaceExternalModels(registry);
when registering dialects (and to add the corresponding library dependency). Fingers crossed that this passes CI!
Both the regular CI (https://github.com/llvm/torch-mlir/pull/1214/checks) and the Release build (https://github.com/llvm/torch-mlir/actions/runs/2988625814) passes for this branch. However, if I build the code in this branch in a directory that contains the build from any other branch, then the residual _mlirRegisterEverything.cpython-*.so
from the previous build causes the end-to-end tests in this branch to fail, like as follows:
$ python3 python/test/torchscript_e2e_test/basic.py
ImportError: build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs/_mlirRegisterEverything.cpython-38-x86_64-linux-gnu.so: undefined symbol: mlirRegisterAllPasses
This happens because when the MLIR Python bindings try to import the _mlirRegisterEverything
module, the dynamic importer fails since this PR drops the reference to MLIRPythonExtension.RegisterEverything
.
Short-term workaround: Delete the _mlirRegisterEverything.*.so
file in the build directory.
Perhaps a longer-term solution is to handle the ImportError
exception just like we handle the ModuleNotFoundError
exception (by making process_initializer_module()
return False
. However, doing so has the effect of masking syntax errors in the (user-provided) _site_initialize_*.py
modules.