torch-mlir icon indicating copy to clipboard operation
torch-mlir copied to clipboard

testcase failing due to commit 7b23a1f5d87064daded4b89487dc968fd933051e

Open amd-vivekag opened this issue 10 months ago • 11 comments

The following commit (https://github.com/llvm/torch-mlir/commit/7b23a1f5d87064daded4b89487dc968fd933051e) has caused the testcase failure: https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops/onnx/node/generated/test_averagepool_2d_ceil/

This is blocking bump torchmlir in IREE.

Getting error:

EXEC @test_averagepool_2d_ceil
[FAILED] result[0]: element at index 1 (5) does not match the expected (7.5); expected that the view is equal to contents of a view of 1x1x2x2xf32
  expected:
1x1x2x2xf32=[[[6 7.5][12 13.5]]]
  actual:
1x1x2x2xf32=[[[6 5][8 6]]]

After removing this commit, the error is not there.

amd-vivekag avatar Mar 07 '25 11:03 amd-vivekag

@amd-vivekag I can't reproduce this issue. The test passed:

PASSED onnx/node/generated/test_averagepool_2d_ceil/run_module_io_flags.txt::model.mlir::cpu_llvm_sync

I followed the instructions here: https://github.com/iree-org/iree-test-suites/tree/main/onnx_ops

The printout above came from this command:

pytest -n auto -rA --timeout=30 --durations=20 --config-files=configs/onnx_ops_cpu_llvm_sync.json --report-log=/tmp/onnx_ops_cpu_logs.json

I made sure that the torch-mlir-opt referred to is the one I just sync'ed, built and verified that has my average pooling change.

$ which torch-mlir-opt /myfolder/torch-mlir/build/bin/torch-mlir-opt

Can you please let me know which steps you used to reproduce this issue?

ivangarcia44 avatar Mar 07 '25 20:03 ivangarcia44

Hi @amd-vivekag, I have not heard back from you in a month. Are things working now? If yes, can this issue be closed?

ivangarcia44 avatar Apr 02 '25 15:04 ivangarcia44

Hi @ivangarcia44, I'm sorry, I somehow missed the last month message.

I was seeing the failure in CI with the CL mentioned above. I'll have to check if I was able to reproduce it locally (as far as I remember, I was able to reproduce it). I'll get back to you with more details.

Thanks for following up. Really appreciate it.

amd-vivekag avatar Apr 02 '25 16:04 amd-vivekag

Hi @ivangarcia44 ,

I'm able to reproduce the issue. I followed following steps:

  1. git clone [email protected]:iree-org/iree-test-suites.git
  2. cd iree-test-suites/onnx_ops
  3. created python environment as per the instructions mentioned in onnx_ops/README.md
  4. pip install -r requirements.txt
  5. pip install -r requirements-iree.txt
  6. Ran following command: pytest -v -n auto -rA --timeout=30 --durations=20 --config-files=configs/onnx_ops_cpu_llvm_sync.json --report-log=/tmp/onnx_ops_cpu_logs.json -k test_averagepool_2d_ceil

You should see following message:

Stdout diagnostics:
EXEC @test_averagepool_2d_ceil
[FAILED] result[0]: element at index 1 (5) does not match the expected (7.5); expected that the view is equal to contents of a view of 1x1x2x2xf32
  expected:
1x1x2x2xf32=[[[6 7.5][12 13.5]]]
  actual:
1x1x2x2xf32=[[[6 5][8 6]]]

Please let me know if you need any other information from my side in this regard.

amd-vivekag avatar Apr 03 '25 06:04 amd-vivekag

@amd-vivekag I am able to reproduce the failure. I am taking a look at it.

ivangarcia44 avatar Apr 03 '25 21:04 ivangarcia44

This issue is fixed here: https://github.com/llvm/torch-mlir/pull/4144

ivangarcia44 avatar Apr 28 '25 22:04 ivangarcia44

The fix for this issue was merged into torch-mlir: https://github.com/llvm/torch-mlir/pull/4144. Closing this issue.

ivangarcia44 avatar May 22 '25 16:05 ivangarcia44

Hi @amd-vivekag, the fix for this issue was merged in torch-mlir: https://github.com/llvm/torch-mlir/pull/4144. Can you please close this issue when you get a chance? I can't close it.

ivangarcia44 avatar May 22 '25 16:05 ivangarcia44

@ivangarcia44 I'm still seeing the issue. Since your changes in torch-mlir are not part of iree build yet, so it might take sometime. Can you please let me know once iree build has your changes?

amd-vivekag avatar May 26 '25 10:05 amd-vivekag

@ivangarcia44 I'm still seeing the issue. Since your changes in torch-mlir are not part of iree build yet, so it might take sometime. Can you please let me know once iree build has your changes?

@amd-vivekag I will keep monitoring this and let you know when the change is integrated. Thanks, Ivan

ivangarcia44 avatar May 26 '25 13:05 ivangarcia44

Note to find out if the issue is gone. Today, June 23rd, 2025, the bug fix has not made it yet to IREE.

git clone https://github.com/iree-org/iree.git cd iree git submodule update --init find . -name "Pooling.cpp" | xargs grep -nH "doesAvgPoolDivisorNeedsClamping"

If there is a hit above then IREE got the https://github.com/llvm/torch-mlir/pull/4144 bug fix. If that is the case, then do the reproduction steps from @amd-vivekag written on April 3.

ivangarcia44 avatar Jun 23 '25 18:06 ivangarcia44

Verified that the change in https://github.com/llvm/torch-mlir/pull/4144 made it to IREE and now the reproduction steps provided by @amd-vivekag pass with the message below. Closing this issue.

========================================= short test summary info ========================================== PASSED onnx/node/generated/test_averagepool_2d_ceil/run_module_io_flags.txt::model.mlir::cpu_llvm_sync ============================================ 1 passed in 2.40s =============================================

ivangarcia44 avatar Oct 02 '25 00:10 ivangarcia44

Hi @amd-vivekag, I tried your reproduction steps (which previously failed) and they pass now. Verified that my fix made it to IREE. See the reproduction steps log in my previous message. Can you please close this issue since I don't have permission to do it?

ivangarcia44 avatar Oct 02 '25 00:10 ivangarcia44

Thanks @ivangarcia44 for following up on this.

amd-vivekag avatar Oct 07 '25 09:10 amd-vivekag