onnx-mlir
onnx-mlir copied to clipboard
Handle a big value for the maximum trip count in ONNX.loop
There is segfault when running a model whose ONNX Loop’s trip count is a big value. Below are snippets of generated code when compiling the model:
At ONNX level (onnx-mlir --EmitONNXIR
):
%123 = "onnx.Constant"() {value = dense<9223372036854775807> : tensor<i64>} : () -> tensor<i64>
%129:8 = "onnx.Loop"(%123, %5, %124, %122, %122, %125, %126, %127, %128) ({
At MLIR level (onnx-mlir --EmitMLIR
):
%190 = "krnl.global"() {name = "constant_63", shape = [], value = dense<9223372036854775807> : tensor<i64>} : () -> memref<i64>
%203 = affine.load %190[] : memref<i64>
%204 = arith.index_cast %203 : i64 to index
%205 = memref.alloc(%204) {alignment = 16 : i64} : memref<?xmemref<?x30xf32>
It looks like the segfault happened because an invalid value was passed to alloc
. In this case the invalid value is a big value that can be negative when being casted.
This is a simple example:
func @test_loop_max_trip_count(%arg0: tensor<?x30xf32>) -> tensor<?x?x30xf32> {
%0 = "onnx.Constant"() {value = dense<9223372036854775807> : tensor<i64>} : () -> tensor<i64>
%1 = "onnx.Constant"() {value = dense<true> : tensor<i1>} : () -> tensor<i1>
%2 = "onnx.Constant"() {value = dense<0> : tensor<i32>} : () -> tensor<i32>
%3 = "onnx.Constant"() {value = dense<30> : tensor<i32>} : () -> tensor<i32>
%4:4 = "onnx.Loop"(%0, %1, %2, %3, %arg0) ({
^bb0(%arg1: tensor<i64>, %arg2: tensor<i1>, %arg3: tensor<i32>, %arg4: tensor<i32>, %arg5: tensor<?x30xf32>):
%5 = "onnx.Constant"() {value = dense<1> : tensor<i32>} : () -> tensor<i32>
%6 = "onnx.Add"(%arg3, %5) : (tensor<i32>, tensor<i32>) -> tensor<i32>
%7 = "onnx.Relu"(%arg5) : (tensor<?x30xf32>) -> tensor<?x30xf32>
%8 = "onnx.Less"(%6, %arg4) : (tensor<i32>, tensor<i32>) -> tensor<i1>
onnx.Return %8, %6, %arg4, %7 : tensor<i1>, tensor<i32>, tensor<i32>, tensor<?x30xf32>
}) : (tensor<i64>, tensor<i1>, tensor<i32>, tensor<i32>, tensor<?x30xf32>) -> (tensor<i32>, tensor<i32>, tensor<?x30xf32>, tensor<?x?x30xf32>)
return %4#3 : tensor<?x?x30xf32>
}
For what it's worth, 9223372036854775807 is LONG_MAX (0x7FFFFFFFFFFFFFFF), the largest positive number representable by a 64-bit signed long.
@tungld do you have a reduced test case that can be used to reproduce the issue. The information provided in this issue is not sufficient to reproduce the problem.
@etiotto I haven't yet had it. Will try to prepare one example.
As discussed in our meetings, a naive solution would simply force down inf to, say 30. We won't crash anymore, and the one between 30-inf will return wrong values as opposed to crash.
Then, we can do a smarter approach with dyn data structures and/or inspect/execute if the trip count is not dependent on data (which I think may be the case if we have "compute until error is smaller than xxx"
Some PRs contributes to solving this issue: #1449 #1462