onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Compiling bidaf-9.onnx takes 170GB of memory

Open cjvolzka opened this issue 1 year ago • 15 comments

When I attempt to compile the bidaf-9 model from the onnx model zoo, compiling stops after about 7 minutes with no information.

Watching memory usage during compiling, it uses about 300mb upt to about 5 min. After that, it starts to grow reaching just short of 60Gb before it gets killed at 7 min, presumably by the Linux OOM Killer as my system runs out of memory.

cjvolzka avatar Feb 20 '24 16:02 cjvolzka

@negiyas reported he was able to successfully compile the model but it took 170Gb of memory.

@imaihal reported if LLVM patch https://reviews.llvm.org/D148487 is applied, memory usage caps at 1GB and takes about 5 min.

@tungld do you have bandwidth to see if we can get your llvm patch merged into llvm to fix the issue?

cjvolzka avatar Feb 20 '24 16:02 cjvolzka

I thought we fixed this problem originally observed in https://github.com/onnx/onnx-mlir/issues/2084 last year but I guess @tungld's LLVM patch was reverted due to some problem?

gongsu832 avatar Feb 21 '24 05:02 gongsu832

@gongsu832 yes, I did the LLVM patch but it somehow caused flang in llvm failed, so it was reverted.

tungld avatar Feb 21 '24 07:02 tungld

Hi all, I modified @tungld's LLVM patch so it doesn't crash the repro in https://github.com/llvm/llvm-project/issues/62802 which cause the patch being reverted.

Could anyone help to test if it still helps bidaf-9 model? (I don't know how to setup onnx-mlir to run an onnx model :( )

Here's the modified LLVM patch. dangling-const.patch I'll create a pull request to llvm repo once we can confirm the patch helps.

python3kgae avatar Feb 22 '24 05:02 python3kgae

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

tungld avatar Feb 22 '24 05:02 tungld

@python3kgae I checked bidaf-9 with your patch, and memory consumption was peak at around 1.7 GB. So it does help bidaf-9. Thank you very much @python3kgae!

tungld avatar Feb 22 '24 06:02 tungld

@python3kgae great, thanks for your patch! It looks like your patch is for old LLVM code. Do you have a patch for recent LLVM code?

I'm using old LLVM code to test the old repro. I'll change to recent LLVM code when create pull request to LLVM repo.

python3kgae avatar Feb 22 '24 13:02 python3kgae

Pull request created https://github.com/llvm/llvm-project/pull/82708

python3kgae avatar Feb 22 '24 23:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

python3kgae avatar Feb 23 '24 04:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

SIMD related code typically only works on s390x Linux so failure on Windows isn't surprising. @AlexandreEichenberger should be able to provide more definitive answer since he wrote most of the SIMD code.

gongsu832 avatar Feb 24 '24 02:02 gongsu832

Created a PR to create only one globalOp for all strings in a string literal https://github.com/onnx/onnx-mlir/pull/2727

This could save a lot of time when debugging bidaf-9 model.

python3kgae avatar Feb 25 '24 04:02 python3kgae

I tried to run onnx-mlir bidaf-9.onnx But hit error in https://github.com/onnx/onnx-mlir/blob/main/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L709 because estimatedSimdLoopTripCount not initialized.

Is this expected for Windows build of onnx-mlir?

I believe that this happens because on Windows, we run with warning as error. If you don't mind, probably just adding =0

 int64_t estimatedSimdLoopTripCount = 0;

here https://github.com/onnx/onnx-mlir/blob/01c5c9fb536a43cde36abccf562bb2f6cb594cb4/src/Conversion/ONNXToKrnl/Math/Reduction.cpp#L490 would fix the problem.

In general, SIMD works on x86 Linux, got to assume it does to for Window.

AlexandreEichenberger avatar Feb 26 '24 19:02 AlexandreEichenberger

The fix is in https://github.com/llvm/llvm-project/commit/c11627c2f4d550613a3cb360c89a0cf52d2eb720

python3kgae avatar Feb 27 '24 02:02 python3kgae

@python3kgae thanks so much!!!

tungld avatar Feb 27 '24 03:02 tungld

@python3kgae thanks so much!!!

Thank you for create this project :)

python3kgae avatar Feb 27 '24 04:02 python3kgae

Closing as this was fixed by recent llvm uplift.

cjvolzka avatar Apr 15 '24 14:04 cjvolzka