onnx-mlir
onnx-mlir copied to clipboard
Incorrect output_signature() for models with dynamic dimensions
Description
Output signature from dynamic models has changed. As a specific example, the model from https://github.ibm.com/ai-on-z/in-house-onnx-models/blob/main/TF-ONNX/ccf_gru_dyn_tf2onnx_new.onnx as of 09965003137f82d8891676c986ec6403faa9c3cb
used to return: (note 1
for the final dim)
output signature: [ { "type" : "f32" , "dims" : [-1 , -1 , 1] , "name" : "dense" }
]
But as of 91ee7a2ed38e91cf480b06d0361899d2d63ce5ca returns: note -1
for the final dim)
output signature: [ { "type" : "f32" , "dims" : [-1 , -1 , -1] , "name" : "dense" }
]
Running the model through netron.app shows the final dimesion in the .onnx file is indeed supposed to be 1
Reproduction
Model was compiled using flags
--O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ccf_lstm_dyn_tf2onnx_new.onnx
sig_client.py code:
import os
import sys
from PyRuntime import ExecutionSession
model_lib = sys.argv[1]
if not os.path.exists(model_lib):
raise FileNotFoundError('model lib "' + model_lib + '" was not found.')
session = ExecutionSession(model_lib)
print("input signature: " + session.input_signature())
print("output signature: " + session.output_signature())
which was called using python 3.8:
PYTHONPATH=/code/build/lib python3 sig_client.py NNPA/ccf_lstm_dyn_tf2onnx_new.so
This https://github.com/onnx/onnx-mlir/commit/a18b5447b3afdea9ad91745f442664f7858f7f3e may change the output signature.
Before that commit, signatures were getting directly from the input onnx model so that they were the same as we see in netron.
After that commit, signatures are generated automatically after doing shape inference so that the signatures would correctly reflect the input and output tensors of the entry point in the generated code.
I run onnx-mlir --EmitONNXIR
to see the output shape, and it is return %47 : tensor<?x?x?xf32>
. So the signature is correctly generated.
tungld@53315fb59bc7:~/dl/in-house-onnx-models/TF-ONNX$ /home/tungld/dl/onnx-mlir/build/Debug/bin/onnx-mlir --EmitONNXIR ccf_gru_dyn_tf2onnx_new.onnx
tungld@53315fb59bc7:~/dl/in-house-onnx-models/TF-ONNX$ tail ccf_gru_dyn_tf2onnx_new.tmp
%42 = "onnx.Constant"() : () -> tensor<200x1xf32>
%43 = "onnx.MatMul"(%41, %42) {onnx_node_name = "sequential/dense/Tensordot/MatMul"} : (tensor<?x200xf32>, tensor<200x1xf32>) -> tensor<?x1xf32>
%44 = "onnx.Reshape"(%43, %39) {onnx_node_name = "sequential/dense/Tensordot"} : (tensor<?x1xf32>, tensor<3xi64>) -> tensor<?x?x?xf32>
%45 = "onnx.Constant"() : () -> tensor<1xf32>
%46 = "onnx.Add"(%44, %45) {onnx_node_name = "sequential/dense/BiasAdd"} : (tensor<?x?x?xf32>, tensor<1xf32>) -> tensor<?x?x?xf32>
%47 = "onnx.Sigmoid"(%46) {onnx_node_name = "sequential/dense/Sigmoid"} : (tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
return %47 : tensor<?x?x?xf32>
}
"onnx.EntryPoint"() {func = @main_graph} : () -> ()
}
So this issue perhaps shows that we should improve our shape inference.
So our shape inference lost track that the input was 1 for the last dim? Or the input to the model function did not specify 1? What was it @tungld ?
So our shape inference lost track that the input was 1 for the last dim? Or the input to the model function did not specify 1? What was it @tungld ?
I think our shape inference lost track of 1 from onnx.Reshape
near the end of the function. Please look at %44 = "onnx.Reshape"
in the following code, where the new shape is not compile-time constant:
%32 = "onnx.SqueezeV11"(%Y_0) {axes = [1], onnx_node_name = "Squeeze__51"} : (tensor<?x1x?x200xf32>) -> tensor<?x?x200xf32>
%33 = "onnx.Shape"(%32) {onnx_node_name = "sequential/dense/Tensordot/Shape"} : (tensor<?x?x200xf32>) -> tensor<3xi64>
%34 = "onnx.Cast"(%33) {onnx_node_name = "sequential/dense/Tensordot/Shape__68", to = i32} : (tensor<3xi64>) -> tensor<3xi32>
%35 = "onnx.Constant"() {value = dense<[0, 1]> : tensor<2xi32>} : () -> tensor<2xi32>
%36 = "onnx.Gather"(%34, %35) {axis = 0 : si64, onnx_node_name = "sequential/dense/Tensordot/GatherV2"} : (tensor<3xi32>, tensor<2xi32>) -> tensor<2xi32>
%37 = "onnx.Constant"() {value = dense<1> : tensor<1xi32>} : () -> tensor<1xi32>
%38 = "onnx.Concat"(%36, %37) {axis = 0 : si64, onnx_node_name = "sequential/dense/Tensordot/concat_1"} : (tensor<2xi32>, tensor<1xi32>) -> tensor<3xi32>
%39 = "onnx.Cast"(%38) {onnx_node_name = "sequential/dense/Tensordot__73", to = i64} : (tensor<3xi32>) -> tensor<3xi64>
%44 = "onnx.Reshape"(%43, %39) {onnx_node_name = "sequential/dense/Tensordot"} : (tensor<?x1xf32>, tensor<3xi64>) -> tensor<?x?x?xf32>
It seems that we can derive from onnx.Concat
that the last element of the new shape %39
is 1 by looking at onnx.Concat(%36, %37)
and %37
is 1.
Any update on this one or is investigation ongoing?
Any ETA on when this one will be resolved?
@tungld do you mind to look into this one? If you think this might be resolved by the newer scheme on shape inference, then let's work on that one instead of a quick fix for this issue.
I tried the fix. It appears to work for CPU compiled models but when I tried for NNPA, I still got the incorrect output shape (final output dim was -1
instead of the expected 1
)
CPU: (--O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z14 ccf_lstm_dyn_tf2onnx_new.onnx -o CPU/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551 -v
Model Signatures /Dreamer/cjvolzka/dlc-automation//build/cjv-models/CPU/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551.so
Input: [ { "type" : "f32" , "dims" : [-1 , -1 , 204] , "name" : "input" }]
Output: [ { "type" : "f32" , "dims" : [-1 , -1 , 1] , "name" : "dense" }]
NNPA: --O3 --EmitLib --mtriple=s390x-ibm-loz --mcpu=z16 --maccel=NNPA ccf_lstm_dyn_tf2onnx_new.onnx -o NNPA/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551 -v
Model Signatures /Dreamer/cjvolzka/dlc-automation//build/cjv-models/NNPA/ccf_lstm_dyn_tf2onnx_new.zosdev.pr-551.so
Input: [ { "type" : "f32" , "dims" : [-1 , -1 , 204] , "name" : "input" }]
Output: [ { "type" : "f32" , "dims" : [-1 , -1 , -1] , "name" : "dense" }]
@tungld is it because the signature is only checked if more precise for element wise ops only?
@cjvolzka I think the issue was fixed for NNPA. Please check it again. Thanks!
Yes it has. Thanks for reminding me!