torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

python: trim registration and loading of dialects and passes

Open ashay opened this issue 2 years ago • 6 comments

In the interest of merging upstream LLVM quickly, a previous patch (7f08169) updated the torch-mlir build to register all dialects and passes through Python bindings. This patch limits the dialects and passes to only those that are used in torch-mlir.

Key to this change are the removal of MLIRPythonExtension.RegisterEverything and the introduction of a new Python module (_mlir_libs/_site_initialize_0.py), where we register the dialects and passes used by torch-mlir.

Release build now passes for this branch: https://github.com/llvm/torch-mlir/actions/runs/2843596925

ashay avatar Aug 12 '22 04:08 ashay

Sorry, the usual CI run failed. Marking as draft while I diagnose and fix the problem.

ashay avatar Aug 12 '22 05:08 ashay

Rebasing to see if PR https://github.com/llvm/torch-mlir/pull/1218 resolves the build failure.

ashay avatar Aug 12 '22 22:08 ashay

I realized that hidden in the CI logs is an assertion failure:

python: /home/runner/work/torch-mlir/torch-mlir/externals/llvm-project/llvm/include/llvm/Support/Casting.h:578: decltype(auto) llvm::cast(From *) [To = mlir::ReifyRankedShapedTypeOpInterface, From = mlir::Operation]: Assertion `isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed.

My local build did not enable assertions, so I couldn't reproduce the problem. Rebuilding with assertions set to ON now.

ashay avatar Aug 12 '22 23:08 ashay

Looks like something broke the ref backend tests. The output from the failures is pretty slim and that limits any further insights I might have (I wish we biased towards verbose output on failure for such things -- much more CI friendly).

We actually have good errors here, but we are squelching them when running our e2e tests in parallel due to it being really hard to capture that info for a crash with Python multiprocessing. I suppose for the CI we could run in single-threaded mode since we only have 2 cores anyways. But yes, this was a regression when multiprocessing this.

silvasean avatar Aug 15 '22 22:08 silvasean

(the usual answer is to run single-threaded with -s, but that doesn't help on the CI or causes an extra iteration step)

silvasean avatar Aug 15 '22 22:08 silvasean

You get 4 vCPUs on Linux and 3 for macOS

powderluv avatar Aug 16 '22 03:08 powderluv

I think I may have found the culprit. The key change was to call mlir::tensor::registerInferTypeOpInterfaceExternalModels(registry); when registering dialects (and to add the corresponding library dependency). Fingers crossed that this passes CI!

ashay avatar Aug 30 '22 23:08 ashay

Both the regular CI (https://github.com/llvm/torch-mlir/pull/1214/checks) and the Release build (https://github.com/llvm/torch-mlir/actions/runs/2988625814) passes for this branch. However, if I build the code in this branch in a directory that contains the build from any other branch, then the residual _mlirRegisterEverything.cpython-*.so from the previous build causes the end-to-end tests in this branch to fail, like as follows:

$ python3 python/test/torchscript_e2e_test/basic.py
ImportError: build/tools/torch-mlir/python_packages/torch_mlir/torch_mlir/_mlir_libs/_mlirRegisterEverything.cpython-38-x86_64-linux-gnu.so: undefined symbol: mlirRegisterAllPasses

This happens because when the MLIR Python bindings try to import the _mlirRegisterEverything module, the dynamic importer fails since this PR drops the reference to MLIRPythonExtension.RegisterEverything.

Short-term workaround: Delete the _mlirRegisterEverything.*.so file in the build directory.

Perhaps a longer-term solution is to handle the ImportError exception just like we handle the ModuleNotFoundError exception (by making process_initializer_module() return False. However, doing so has the effect of masking syntax errors in the (user-provided) _site_initialize_*.py modules.

ashay avatar Sep 05 '22 10:09 ashay