iree icon indicating copy to clipboard operation
iree copied to clipboard

[regression-onnx] Numeric regression for Conv ops

Open pdhirajkumarprasad opened this issue 2 weeks ago • 2 comments

What happened?

For the given IR

    %994 = torch.operator "onnx.DequantizeLinear"(%676, %674, %675) : (!torch.vtensor<[1024,32,3,3],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1024,32,3,3],f32>
    %995 = torch.operator "onnx.DequantizeLinear"(%679, %677, %678) : (!torch.vtensor<[1024],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1024],f32>
    %996 = torch.operator "onnx.DequantizeLinear"(%686, %684, %685) : (!torch.vtensor<[2048,1024,1,1],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[2048,1024,1,1],f32>
    %997 = torch.operator "onnx.DequantizeLinear"(%689, %687, %688) : (!torch.vtensor<[2048],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[2048],f32>
    %1456 = torch.operator "onnx.QuantizeLinear"(%arg1, %657, %656) : (!torch.vtensor<[1,1024,14,14],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,1024,14,14],si8>
    %1457 = torch.operator "onnx.DequantizeLinear"(%1456, %657, %656) : (!torch.vtensor<[1,1024,14,14],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,1024,14,14],f32>
    %1465 = torch.operator "onnx.Conv"(%1457, %994, %995) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 32 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [2 : si64, 2 : si64]} : (!torch.vtensor<[1,1024,14,14],f32>, !torch.vtensor<[1024,32,3,3],f32>, !torch.vtensor<[1024],f32>) -> !torch.vtensor<[1,1024,7,7],f32> 
    %1467 = torch.operator "onnx.QuantizeLinear"(%1465, %681, %680) : (!torch.vtensor<[1,1024,7,7],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,1024,7,7],si8>
    %1468 = torch.operator "onnx.DequantizeLinear"(%1467, %681, %680) : (!torch.vtensor<[1,1024,7,7],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,1024,7,7],f32>
    %1469 = torch.operator "onnx.Conv"(%1468, %996, %997) {torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 1 : si64, torch.onnx.kernel_shape = [1 : si64, 1 : si64], torch.onnx.pads = [0 : si64, 0 : si64, 0 : si64, 0 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,1024,7,7],f32>, !torch.vtensor<[2048,1024,1,1],f32>, !torch.vtensor<[2048],f32>) -> !torch.vtensor<[1,2048,7,7],f32>
    return %1469 : !torch.vtensor<[1,2048,7,7],f32>

seeing numeric regression for this change.

If we pass arg1 in place of %1457 for Conv operator then o/p are matching.

Steps to reproduce your issue

command:

iree-compile test.mlir --iree-hal-target-backends=rocm --iree-hip-target=gfx942 -o test.vmfb
iree-run-module --module='test.vmfb' --device=hip --function='main_graph' --input='[email protected]

expected_output.bin.txt input.1.bin.txt test.mlir.txt

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

pdhirajkumarprasad avatar Dec 11 '25 09:12 pdhirajkumarprasad

Surprisingly the error came down to this review comment in the original PR

I am very surprised by this as this is about overflow<nsw> flag in arith.addi When I added the flag to all upstream pipeline arith.addi ops the issue went away, what I am really confused about is that we are talking about this addi

    %45 = scf.for %arg3 = %c0 to %c8 step %c1 iter_args(%arg4 = %25) -> (vector<1x1x1x1x4x1xi32>) {
      %67 = arith.addi %arg3, %c1 overflow<nsw> : index

its never going to overflow then why does it matter, are we mishandling the op without the flag perhaps?

cc @krzysz00 to see if he can provide a reasonable explaination to this

Also cc @jerryyin (to see when back from break)

Edit: Based on looking at optimized.ll and .rocasm files it produces drastically different IR/ code, i will go ahead and send an upstream patch to fix this. My best guess is that even though the flag is on a value that doesnt overflow, as we lower it, it gets propagated to a situation where we actually need these semantics

nirvedhmeshram avatar Dec 11 '25 21:12 nirvedhmeshram

@bjacob maybe this is related to the intermittent resnet onnx failures in CI?

zjgarvey avatar Dec 12 '25 20:12 zjgarvey

@nirvedhmeshram , being on LLVM integration this week, I'm available (and interested) in testing any work-in-progress patch that you might have, and/or step in to help as needed.

bjacob avatar Dec 15 '25 16:12 bjacob

@zjgarvey might be but this issue is consistent and not intermittent, its happening in a very specific case of strided convs @bjacob Thanks, the issue is triaged to optimization after scf-to-cf folding of duplicate blocks which were previously blocked due to a CSE not happening due to a overflow flag on an addition op, however the current theory is that the backend is not able to handle the optimized IR, I will update here if there is some fix to try.

nirvedhmeshram avatar Dec 15 '25 22:12 nirvedhmeshram