onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Handle a big value for the maximum trip count in ONNX.loop

Open tungld opened this issue 2 years ago • 5 comments

There is segfault when running a model whose ONNX Loop’s trip count is a big value. Below are snippets of generated code when compiling the model:

At ONNX level (onnx-mlir --EmitONNXIR):

%123 = "onnx.Constant"() {value = dense<9223372036854775807> : tensor<i64>} : () -> tensor<i64>
%129:8 = "onnx.Loop"(%123, %5, %124, %122, %122, %125, %126, %127, %128) ({

At MLIR level (onnx-mlir --EmitMLIR):

%190 = "krnl.global"() {name = "constant_63", shape = [], value = dense<9223372036854775807> : tensor<i64>} : () -> memref<i64>

%203 = affine.load %190[] : memref<i64>

%204 = arith.index_cast %203 : i64 to index

%205 = memref.alloc(%204) {alignment = 16 : i64} : memref<?xmemref<?x30xf32>

It looks like the segfault happened because an invalid value was passed to alloc. In this case the invalid value is a big value that can be negative when being casted.

This is a simple example:

func @test_loop_max_trip_count(%arg0: tensor<?x30xf32>) -> tensor<?x?x30xf32> {
  %0 = "onnx.Constant"() {value = dense<9223372036854775807> : tensor<i64>} : () -> tensor<i64>
  %1 = "onnx.Constant"() {value = dense<true> : tensor<i1>} : () -> tensor<i1>
  %2 = "onnx.Constant"() {value = dense<0> : tensor<i32>} : () -> tensor<i32>
  %3 = "onnx.Constant"() {value = dense<30> : tensor<i32>} : () -> tensor<i32>
  %4:4 = "onnx.Loop"(%0, %1, %2, %3, %arg0) ({
  ^bb0(%arg1: tensor<i64>, %arg2: tensor<i1>, %arg3: tensor<i32>, %arg4: tensor<i32>, %arg5: tensor<?x30xf32>):
    %5 = "onnx.Constant"() {value = dense<1> : tensor<i32>} : () -> tensor<i32>
    %6 = "onnx.Add"(%arg3, %5) : (tensor<i32>, tensor<i32>) -> tensor<i32>
    %7 = "onnx.Relu"(%arg5) : (tensor<?x30xf32>) -> tensor<?x30xf32>
    %8 = "onnx.Less"(%6, %arg4) : (tensor<i32>, tensor<i32>) -> tensor<i1>
    onnx.Return %8, %6, %arg4, %7 : tensor<i1>, tensor<i32>, tensor<i32>, tensor<?x30xf32>
  }) : (tensor<i64>, tensor<i1>, tensor<i32>, tensor<i32>, tensor<?x30xf32>) -> (tensor<i32>, tensor<i32>, tensor<?x30xf32>, tensor<?x?x30xf32>)
  return %4#3 : tensor<?x?x30xf32>
}

tungld avatar May 11 '22 03:05 tungld

For what it's worth, 9223372036854775807 is LONG_MAX (0x7FFFFFFFFFFFFFFF), the largest positive number representable by a 64-bit signed long.

gongsu832 avatar May 11 '22 03:05 gongsu832

@tungld do you have a reduced test case that can be used to reproduce the issue. The information provided in this issue is not sufficient to reproduce the problem.

etiotto avatar May 12 '22 13:05 etiotto

@etiotto I haven't yet had it. Will try to prepare one example.

tungld avatar May 13 '22 05:05 tungld

As discussed in our meetings, a naive solution would simply force down inf to, say 30. We won't crash anymore, and the one between 30-inf will return wrong values as opposed to crash.

Then, we can do a smarter approach with dyn data structures and/or inspect/execute if the trip count is not dependent on data (which I think may be the case if we have "compute until error is smaller than xxx"

AlexandreEichenberger avatar May 15 '22 21:05 AlexandreEichenberger

Some PRs contributes to solving this issue: #1449 #1462

tungld avatar May 26 '22 13:05 tungld