onnx-mlir icon indicating copy to clipboard operation
onnx-mlir copied to clipboard

Incorrect output_signature() for models with dynamic dimensions

Open cjvolzka opened this issue 2 years ago • 4 comments

Description

Output signature from dynamic models has changed. As a specific example, the model from https://github.ibm.com/ai-on-z/in-house-onnx-models/blob/main/TF-ONNX/ccf_gru_dyn_tf2onnx_new.onnx as of 09965003137f82d8891676c986ec6403faa9c3cb used to return: (note 1 for the final dim)

output signature: [   { "type" : "f32" , "dims" : [-1 , -1 , 1] , "name" : "dense" }

]

But as of 91ee7a2ed38e91cf480b06d0361899d2d63ce5ca returns: note -1 for the final dim)

output signature: [   { "type" : "f32" , "dims" : [-1 , -1 , -1] , "name" : "dense" }

]

Running the model through netron.app shows the final dimesion in the .onnx file is indeed supposed to be 1 image

Reproduction

Model was compiled using flags

--O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ccf_lstm_dyn_tf2onnx_new.onnx

sig_client.py code:

import os
import sys
from PyRuntime import ExecutionSession

model_lib = sys.argv[1]
if not os.path.exists(model_lib):
    raise FileNotFoundError('model lib "' + model_lib + '" was not found.')
session = ExecutionSession(model_lib)
print("input signature: " + session.input_signature())
print("output signature: " + session.output_signature())

which was called using python 3.8:

 PYTHONPATH=/code/build/lib python3 sig_client.py NNPA/ccf_lstm_dyn_tf2onnx_new.so

cjvolzka avatar Jun 02 '22 21:06 cjvolzka

This https://github.com/onnx/onnx-mlir/commit/a18b5447b3afdea9ad91745f442664f7858f7f3e may change the output signature.

Before that commit, signatures were getting directly from the input onnx model so that they were the same as we see in netron.

After that commit, signatures are generated automatically after doing shape inference so that the signatures would correctly reflect the input and output tensors of the entry point in the generated code.

I run onnx-mlir --EmitONNXIR to see the output shape, and it is return %47 : tensor<?x?x?xf32> . So the signature is correctly generated.

tungld@53315fb59bc7:~/dl/in-house-onnx-models/TF-ONNX$ /home/tungld/dl/onnx-mlir/build/Debug/bin/onnx-mlir --EmitONNXIR ccf_gru_dyn_tf2onnx_new.onnx
tungld@53315fb59bc7:~/dl/in-house-onnx-models/TF-ONNX$ tail ccf_gru_dyn_tf2onnx_new.tmp
    %42 = "onnx.Constant"() : () -> tensor<200x1xf32>
    %43 = "onnx.MatMul"(%41, %42) {onnx_node_name = "sequential/dense/Tensordot/MatMul"} : (tensor<?x200xf32>, tensor<200x1xf32>) -> tensor<?x1xf32>
    %44 = "onnx.Reshape"(%43, %39) {onnx_node_name = "sequential/dense/Tensordot"} : (tensor<?x1xf32>, tensor<3xi64>) -> tensor<?x?x?xf32>
    %45 = "onnx.Constant"() : () -> tensor<1xf32>
    %46 = "onnx.Add"(%44, %45) {onnx_node_name = "sequential/dense/BiasAdd"} : (tensor<?x?x?xf32>, tensor<1xf32>) -> tensor<?x?x?xf32>
    %47 = "onnx.Sigmoid"(%46) {onnx_node_name = "sequential/dense/Sigmoid"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
    return %47 : tensor<?x?x?xf32>
  }
  "onnx.EntryPoint"() {func = @main_graph} : () -> ()
}

So this issue perhaps shows that we should improve our shape inference.

tungld avatar Jun 03 '22 00:06 tungld

So our shape inference lost track that the input was 1 for the last dim? Or the input to the model function did not specify 1? What was it @tungld ?

AlexandreEichenberger avatar Jun 03 '22 18:06 AlexandreEichenberger

So our shape inference lost track that the input was 1 for the last dim? Or the input to the model function did not specify 1? What was it @tungld ?

I think our shape inference lost track of 1 from onnx.Reshape near the end of the function. Please look at %44 = "onnx.Reshape" in the following code, where the new shape is not compile-time constant:

%32 = "onnx.SqueezeV11"(%Y_0) {axes = [1], onnx_node_name = "Squeeze__51"} : (tensor<?x1x?x200xf32>) -> tensor<?x?x200xf32>
%33 = "onnx.Shape"(%32) {onnx_node_name = "sequential/dense/Tensordot/Shape"} : (tensor<?x?x200xf32>) -> tensor<3xi64>
%34 = "onnx.Cast"(%33) {onnx_node_name = "sequential/dense/Tensordot/Shape__68", to = i32} : (tensor<3xi64>) -> tensor<3xi32>
%35 = "onnx.Constant"() {value = dense<[0, 1]> : tensor<2xi32>} : () -> tensor<2xi32>
%36 = "onnx.Gather"(%34, %35) {axis = 0 : si64, onnx_node_name = "sequential/dense/Tensordot/GatherV2"} : (tensor<3xi32>, tensor<2xi32>) -> tensor<2xi32>
%37 = "onnx.Constant"() {value = dense<1> : tensor<1xi32>} : () -> tensor<1xi32>
%38 = "onnx.Concat"(%36, %37) {axis = 0 : si64, onnx_node_name = "sequential/dense/Tensordot/concat_1"} : (tensor<2xi32>, tensor<1xi32>) -> tensor<3xi32>
%39 = "onnx.Cast"(%38) {onnx_node_name = "sequential/dense/Tensordot__73", to = i64} : (tensor<3xi32>) -> tensor<3xi64>


%44 = "onnx.Reshape"(%43, %39) {onnx_node_name = "sequential/dense/Tensordot"} : (tensor<?x1xf32>, tensor<3xi64>) -> tensor<?x?x?xf32>

It seems that we can derive from onnx.Concat that the last element of the new shape %39 is 1 by looking at onnx.Concat(%36, %37) and %37 is 1.

tungld avatar Jun 06 '22 02:06 tungld

Any update on this one or is investigation ongoing?

cjvolzka avatar Jun 23 '22 16:06 cjvolzka

Any ETA on when this one will be resolved?

cjvolzka avatar Aug 25 '22 13:08 cjvolzka

@tungld do you mind to look into this one? If you think this might be resolved by the newer scheme on shape inference, then let's work on that one instead of a quick fix for this issue.

AlexandreEichenberger avatar Aug 25 '22 14:08 AlexandreEichenberger

I tried the fix. It appears to work for CPU compiled models but when I tried for NNPA, I still got the incorrect output shape (final output dim was -1 instead of the expected 1)

CPU: (--O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z14 ccf_lstm_dyn_tf2onnx_new.onnx -o CPU/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551 -v

Model Signatures /Dreamer/cjvolzka/dlc-automation//build/cjv-models/CPU/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551.so
	Input: [    { "type" : "f32" , "dims" : [-1 , -1 , 204] , "name" : "input" }]
	Output: [   { "type" : "f32" , "dims" : [-1 , -1 , 1] , "name" : "dense" }]

NNPA: --O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ccf_lstm_dyn_tf2onnx_new.onnx -o NNPA/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551 -v

Model Signatures /Dreamer/cjvolzka/dlc-automation//build/cjv-models/NNPA/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551.so
	Input: [    { "type" : "f32" , "dims" : [-1 , -1 , 204] , "name" : "input" }]
	Output: [   { "type" : "f32" , "dims" : [-1 , -1 , -1] , "name" : "dense" }]

cjvolzka avatar Aug 30 '22 14:08 cjvolzka

@tungld is it because the signature is only checked if more precise for element wise ops only?

AlexandreEichenberger avatar Aug 31 '22 14:08 AlexandreEichenberger

@cjvolzka I think the issue was fixed for NNPA. Please check it again. Thanks!

tungld avatar Sep 15 '22 05:09 tungld

Yes it has. Thanks for reminding me!

cjvolzka avatar Sep 15 '22 12:09 cjvolzka