torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

Implementation of torch-to-linalg lowering of AtenOuterOp

Open amemov opened this issue 9 months ago • 18 comments

An attempt to resolve #4093

Initial implementation:

  • Defined the op in Linear.cpp

amemov avatar Mar 19 '25 20:03 amemov

Hi, this is my first time contributing to the project - if you have any feedback or suggestions, I would really appreciate that.

amemov avatar Mar 19 '25 20:03 amemov

Thanks for picking this up.

There isn't any reason to include quantization logic for this op since it doesn't have any qdq fusion implemented in FuseQuantizedOps.cpp.

It would also be a bit better to implement this directly as a linalg.generic op, rather than unsqueezes and a matmul with a reduction dim size of 1. If you were to do the unsqueeze/matmul approach, it would be more appropriate to put this logic in DecomposeComplexOps.cpp.

Also, please do add e2e tests somewhere in ./projects/pt1/python/torch_mlir_e2e_test/test_suite/.

zjgarvey avatar Mar 20 '25 18:03 zjgarvey

Also, be sure to run either pre-commit run --all (you will need to install it with pip install pre-commit) or git clang-format to auto format the files.

zjgarvey avatar Apr 02 '25 16:04 zjgarvey

I changed it to the init tensor and ran pre-commit - everything looks good on my end.

amemov avatar Apr 03 '25 14:04 amemov

Hi @amemov, can you please take a look at the CI failure?

Hi @vivekkhandelwal1, I skimmed it briefly before - I didn't see any failures specifically related to torch.outer() lowering that I wrote and to my test case.

I will take a better look at it today, but so far I'm not really sure what exactly I need to modify / add here.

amemov avatar Apr 08 '25 11:04 amemov

Hi @amemov, can you please take a look at the CI failure?

Hi @vivekkhandelwal1, I skimmed it briefly before - I didn't see any failures specifically related to torch.outer() lowering that I wrote and to my test case.

I will take a better look at it today, but so far I'm not really sure what exactly I need to modify / add here.

Hi @amemov, some test(s) is/are crashing for the fx_importer config. Most probably, it will be the one that you have added. In order to find out which test is crashing you need to run the tests serially. You may use the following command:

python -m projects.pt1.e2e_testing.main --config=fx_importer -s

The above command will run all the tests one by one. And, the last test run will be the one that's crashing. Then, you can figure out the fix for that.

vivekkhandelwal1 avatar Apr 09 '25 12:04 vivekkhandelwal1

@vivekkhandelwal1 The problem was raised from the test file that I wrote:

torch-mlir/externals/llvm-project/llvm/include/llvm/Support/Casting.h:566: decltype(auto) llvm::cast(const From&) [with To = mlir::RankedTensorType; From = mlir::Type]: Assertion isa<To>(Val) && "cast<Ty>() argument of incompatible type!"' failed

I resolved it by changing the casting and dimensions for operands. On my machine, it now passes AttenOuter test.

amemov avatar Apr 12 '25 15:04 amemov

@amemov, a contributor has added the lowering for this same op through decomposition here: https://github.com/llvm/torch-mlir/pull/4138.

Although your PR is old so in the case of conflict, you should get a chance to complete it, but their approach (via decomposition) is a better one. Can you and @ivanamitreski find a solution out of this?

vivekkhandelwal1 avatar Apr 22 '25 12:04 vivekkhandelwal1

@amemov Please resolve the remaining comments.

vivekkhandelwal1 avatar Jun 09 '25 14:06 vivekkhandelwal1

@zjgarvey could you take at the changes? Thank you!

amemov avatar Jun 16 '25 20:06 amemov

@zjgarvey

amemov avatar Jun 25 '25 03:06 amemov

The tests are failing, so I'd recommend running your new tests locally with each of the different configs.

zjgarvey avatar Jun 26 '25 14:06 zjgarvey

@amemov I am part of a group which has adapted this conversion pattern as part of our lowering process for a MoE model down to the Linalg MLIR dialect. To fully integrate this pattern into our codebase, we need to pull it from the production Torch-MLIR. If you can address the test failures as suggested by the reviewer so it can be merged, we would greatly appreciate it. Thanks!

michizhou avatar Jul 01 '25 23:07 michizhou

The tests are failing, so I'd recommend running your new tests locally with each of the different configs.

@zjgarvey , I looked into the test failures and they're all unrelated to my AttenOuterOp changes. The 6 failing tests are all FX importer issues - nothing to do with the decomposition patterns I added.

I verified my implementation works correctly by testing the decomposition manually as well, so I'm a little confused. I tried release and debug builds - both work on my end. Also, if you look at the failures in the CI it doesn't as well that the problem is due to the proposed decomposition

amemov avatar Sep 03 '25 21:09 amemov

The tests are failing, so I'd recommend running your new tests locally with each of the different configs.

@zjgarvey , I looked into the test failures and they're all unrelated to my AttenOuterOp changes. The 6 failing tests are all FX importer issues - nothing to do with the decomposition patterns I added.

I verified my implementation works correctly by testing the decomposition manually as well, so I'm a little confused. I tried release and debug builds - both work on my end. Also, if you look at the failures in the CI it doesn't as well that the problem is due to the proposed decomposition

Can you sync with main and we can rerun the Ci make sure it's not a failure specific to your changes?

We won't be able to merge this unless the CI passes.

zjgarvey avatar Sep 03 '25 22:09 zjgarvey

@zjgarvey , I re-ran the tests on my machine after synching my branch with the main (and fetching these changes) - the errors I see are not due to AttenOuterOp lowering:

(mlir_venv) [ashepelev@thinkpadx1-carbon build]$ ninja check-torch-mlir
[0/1] Running the torch-mlir regression tests
Enabling sparsity propagation tests
Enabling Torch v2.3+ tests
Skipping onnx tests.. no onnx
FAIL: TORCH_MLIR :: python/fx_importer/sparsity/sparse_test.py (43 of 117)
******************** TEST 'TORCH_MLIR :: python/fx_importer/sparsity/sparse_test.py' FAILED ********************
Exit Code: 2

Command Output (stderr):
--
/home/ashepelev/LLVM-Stuff/torch-mlir/mlir_venv/bin/python3.13 /home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py | FileCheck /home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py # RUN: at line 6
+ /home/ashepelev/LLVM-Stuff/torch-mlir/mlir_venv/bin/python3.13 /home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py
+ FileCheck /home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py
Traceback (most recent call last):
  File "/home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py", line 19, in <module>
    from torch_mlir_e2e_test.linalg_on_tensors_backends.refbackend import (
        RefBackendLinalgOnTensorsBackend,
    )
ModuleNotFoundError: No module named 'torch_mlir_e2e_test.linalg_on_tensors_backends.refbackend'
FileCheck error: '<stdin>' is empty.
FileCheck command line:  FileCheck /home/ashepelev/LLVM-Stuff/torch-mlir/test/python/fx_importer/sparsity/sparse_test.py

--

********************
********************
Failed Tests (1):

  TORCH_MLIR :: python/fx_importer/sparsity/sparse_test.py


Testing Time: 14.54s

Total Discovered Tests: 117
  Unsupported:   9 (7.69%)
  Passed     : 107 (91.45%)
  Failed     :   1 (0.85%)
FAILED: [code=1] tools/torch-mlir/test/CMakeFiles/check-torch-mlir /home/ashepelev/LLVM-Stuff/torch-mlir/build/tools/torch-mlir/test/CMakeFiles/check-torch-mlir 
cd /home/ashepelev/LLVM-Stuff/torch-mlir/build/tools/torch-mlir/test && /home/ashepelev/LLVM-Stuff/torch-mlir/mlir_venv/bin/python3.13 /home/ashepelev/LLVM-Stuff/torch-mlir/build/./bin/llvm-lit -sv /home/ashepelev/LLVM-Stuff/torch-mlir/build/tools/torch-mlir/test
ninja: build stopped: subcommand failed.

amemov avatar Sep 04 '25 15:09 amemov

I think this is due to some missing build flags. I think you need to enable the jit IR importer and something else. The development.md doc should have some info about the cmake config for local testing.

Then I'd do the projects/pt1/tools/e2e_test.sh -s -c <failing config> -v to see what is failing.

zjgarvey avatar Sep 04 '25 17:09 zjgarvey

@zjgarvey , I triple checked - all of my files are now in sync with the most recent changes. When I run ninja check-torch-mlir-all I get 0 errors. In the last CI build all errors that I've seen are related to ONNX bf16? Not sure if it is due to the fact that when the last CI ran only my branch of my fork was synched with upstream changes ( but main branch of the fork was not ) - but anyways, at least with the command above I don't see any errors.

Also for reference here is the build command I used - it includes everything to enable e2e testing:

cmake -GNinja -Bbuild \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLLVM_ENABLE_ASSERTIONS=ON \
-DPython3_FIND_VIRTUALENV=ONLY \
-DPython_FIND_VIRTUALENV=ONLY \
-DMLIR_ENABLE_BINDINGS_PYTHON=ON \
-DLLVM_TARGETS_TO_BUILD=host \
-DLLVM_ENABLE_PROJECTS=mlir \
-DLLVM_EXTERNAL_PROJECTS="torch-mlir" \
-DLLVM_EXTERNAL_TORCH_MLIR_SOURCE_DIR="$PWD" \
-DTORCH_MLIR_ENABLE_PYTORCH_EXTENSIONS=ON \
-DTORCH_MLIR_ENABLE_JIT_IR_IMPORTER=ON \
-DCMAKE_C_COMPILER=clang \
-DCMAKE_CXX_COMPILER=clang++ \
-DCMAKE_EXE_LINKER_FLAGS_INIT="--ld-path=ld.lld" \
-DCMAKE_MODULE_LINKER_FLAGS_INIT="--ld-path=ld.lld" \
-DCMAKE_SHARED_LINKER_FLAGS_INIT="--ld-path=ld.lld" \
externals/llvm-project/llvm

amemov avatar Sep 05 '25 04:09 amemov