onnx-mlir Compiling bidaf-9.onnx takes 170GB of memory

When I attempt to compile the bidaf-9 model from the onnx model zoo, compiling stops after about 7 minutes with no information.

Watching memory usage during compiling, it uses about 300mb upt to about 5 min. After that, it starts to grow reaching just short of 60Gb before it gets killed at 7 min, presumably by the Linux OOM Killer as my system runs out of memory.

Feb 20 '24 16:02 cjvolzka

@negiyas reported he was able to successfully compile the model but it took 170Gb of memory.

@imaihal reported if LLVM patch https://reviews.llvm.org/D148487 is applied, memory usage caps at 1GB and takes about 5 min.

@tungld do you have bandwidth to see if we can get your llvm patch merged into llvm to fix the issue?

Feb 20 '24 16:02 cjvolzka

I thought we fixed this problem originally observed in https://github.com/onnx/onnx-mlir/issues/2084 last year but I guess @tungld's LLVM patch was reverted due to some problem?

Feb 21 '24 05:02 gongsu832

@gongsu832 yes, I did the LLVM patch but it somehow caused flang in llvm failed, so it was reverted.

Feb 21 '24 07:02 tungld

Hi all, I modified @tungld's LLVM patch so it doesn't crash the repro in https://github.com/llvm/llvm-project/issues/62802 which cause the patch being reverted.

Could anyone help to test if it still helps bidaf-9 model? (I don't know how to setup onnx-mlir to run an onnx model :( )

Here's the modified LLVM patch. dangling-const.patch I'll create a pull request to llvm repo once we can confirm the patch helps.

Feb 22 '24 05:02 python3kgae

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

Feb 22 '24 05:02 tungld

@python3kgae I checked bidaf-9 with your patch, and memory consumption was peak at around 1.7 GB. So it does help bidaf-9. Thank you very much @python3kgae!

Feb 22 '24 06:02 tungld

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

I'm using old LLVM code to test the old repro. I'll change to recent LLVM code when create pull request to LLVM repo.

Feb 22 '24 13:02 python3kgae

Pull request created https://github.com/llvm/llvm-project/pull/82708

Feb 22 '24 23:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

Feb 23 '24 04:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

SIMD related code typically only works on s390x Linux so failure on Windows isn't surprising. @AlexandreEichenberger should be able to provide more definitive answer since he wrote most of the SIMD code.

Feb 24 '24 02:02 gongsu832

Created a PR to create only one globalOp for all strings in a string literal https://github.com/onnx/onnx-mlir/pull/2727

This could save a lot of time when debugging bidaf-9 model.

Feb 25 '24 04:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

I believe that this happens because on Windows, we run with warning as error. If you don't mind, probably just adding =0

 int64_t estimatedSimdLoopTripCount = 0;

here https://github.com/onnx/onnx-mlir/blob/01c5c9fb536a43cde36abccf562bb2f6cb594cb4/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L490 would fix the problem.

In general, SIMD works on x86 Linux, got to assume it does to for Window.

Feb 26 '24 19:02 AlexandreEichenberger

The fix is in https://github.com/llvm/llvm-project/commit/c11627c2f4d550613a3cb360c89a0cf52d2eb720

Feb 27 '24 02:02 python3kgae

@python3kgae thanks so much!!!

Feb 27 '24 03:02 tungld

@python3kgae thanks so much!!!

Thank you for create this project :)

Feb 27 '24 04:02 python3kgae

Closing as this was fixed by recent llvm uplift.

Apr 15 '24 14:04 cjvolzka