tensorflow-onnx
tensorflow-onnx copied to clipboard
onnx output nan
Hi,
I am converting the model in from saved-model to onnx using python -m tf2onnx.convert --saved-model saved-model --output model.onnx
. More details are below.
The model inference seems to work in keras
>>> from keras.models import load_model
>>> import tensorflow as tf
>>> keras_model = load_model('saved-model')
>>> x = np.zeros(shape=(1,60,18))
>>> y = np.zeros(shape=(1,5,6))
>>> keras_model.predict([x,y])
array([[9.994024e-01, 5.975381e-04]], dtype=float32)
however the output is nan in onnxruntime on the onnx file produced with tf2onnx
>>> import onnxruntime as rt
>>> sess = rt.InferenceSession('model.onnx')
>>> x = x.astype(np.float32)
>>> y = y.astype(np.float32)
>>> sess.run([sess.get_outputs()[0].name],{sess.get_inputs()[0].name:x, sess.get_inputs()[1].name:y})
[array([[nan, nan]], dtype=float32)]
There doesn't seem to be anything in the conversion step that is failing. What else could be going wrong? I can share the model if needed. Thank you.
>>> keras_model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
inputParticle (InputLayer) [(None, 60, 18)] 0
__________________________________________________________________________________________________
inputSV (InputLayer) [(None, 5, 6)] 0
__________________________________________________________________________________________________
inputNormParticle (BatchNormali (None, 60, 18) 72 inputParticle[0][0]
__________________________________________________________________________________________________
inputNormSV (BatchNormalization (None, 5, 6) 24 inputSV[0][0]
__________________________________________________________________________________________________
XdotRR (Lambda) (None, 3540, 18) 0 inputNormParticle[0][0]
__________________________________________________________________________________________________
XdotRS (Lambda) (None, 3540, 18) 0 inputNormParticle[0][0]
__________________________________________________________________________________________________
XdotRK (Lambda) (None, 300, 18) 0 inputParticle[0][0]
__________________________________________________________________________________________________
YdotRV (Lambda) (None, 300, 6) 0 inputNormSV[0][0]
__________________________________________________________________________________________________
Bpp (Lambda) (None, 3540, 36) 0 XdotRR[0][0]
XdotRS[0][0]
__________________________________________________________________________________________________
Bvp (Lambda) (None, 300, 24) 0 XdotRK[0][0]
YdotRV[0][0]
__________________________________________________________________________________________________
convOneParticle (Conv1D) (None, 3540, 60) 2220 Bpp[0][0]
__________________________________________________________________________________________________
convOneSV (Conv1D) (None, 300, 60) 1500 Bvp[0][0]
__________________________________________________________________________________________________
convTwoParticle (Conv1D) (None, 3540, 30) 1830 convOneParticle[0][0]
__________________________________________________________________________________________________
convTwoSV (Conv1D) (None, 300, 30) 1830 convOneSV[0][0]
__________________________________________________________________________________________________
convThreeParticle (Conv1D) (None, 3540, 20) 620 convTwoParticle[0][0]
__________________________________________________________________________________________________
convThreeSV (Conv1D) (None, 300, 20) 620 convTwoSV[0][0]
__________________________________________________________________________________________________
Epp (BatchNormalization) (None, 3540, 20) 80 convThreeParticle[0][0]
__________________________________________________________________________________________________
Evp (BatchNormalization) (None, 300, 20) 80 convThreeSV[0][0]
__________________________________________________________________________________________________
EppBar (Lambda) (None, 60, 20) 0 Epp[0][0]
__________________________________________________________________________________________________
EvpBar (Lambda) (None, 60, 20) 0 Evp[0][0]
__________________________________________________________________________________________________
C (Lambda) (None, 60, 58) 0 inputParticle[0][0]
EppBar[0][0]
EvpBar[0][0]
__________________________________________________________________________________________________
convPredictOne (Conv1D) (None, 60, 60) 3540 C[0][0]
__________________________________________________________________________________________________
convPredictTwo (Conv1D) (None, 60, 30) 1830 convPredictOne[0][0]
__________________________________________________________________________________________________
O (Conv1D) (None, 60, 24) 744 convPredictTwo[0][0]
__________________________________________________________________________________________________
OBar (Lambda) (None, 24) 0 O[0][0]
__________________________________________________________________________________________________
denseEndOne (Dense) (None, 50) 1250 OBar[0][0]
__________________________________________________________________________________________________
normEndOne (BatchNormalization) (None, 50) 200 denseEndOne[0][0]
__________________________________________________________________________________________________
denseEndTwo (Dense) (None, 20) 1020 normEndOne[0][0]
__________________________________________________________________________________________________
denseEndThree (Dense) (None, 10) 210 denseEndTwo[0][0]
__________________________________________________________________________________________________
denseEndFour (Dense) (None, 5) 55 denseEndThree[0][0]
__________________________________________________________________________________________________
output (Dense) (None, 2) 12 denseEndFour[0][0]
==================================================================================================
Total params: 17,737
Trainable params: 17,509
Non-trainable params: 228
__________________________________________________________________________________________________
python -m tf2onnx.convert --saved-model saved-model --output model.onnx
constant_folding: Graph size after: 318 nodes (-125), 417 edges (-159), time = 17.982ms.
function_optimizer: function_optimizer did nothing. time = 0.218ms.
constant_folding: Graph size after: 318 nodes (0), 417 edges (0), time = 6.774ms.
function_optimizer: function_optimizer did nothing. time = 0.169ms.
2021-05-25 13:26:13,213 - INFO - Using tensorflow=2.4.1, onnx=1.8.0, tf2onnx=1.8.4/cd55bf
2021-05-25 13:26:13,213 - INFO - Using opset <onnx, 9>
2021-05-25 13:26:13,235 - INFO - Computed 0 values for constant folding
2021-05-25 13:26:13,433 - INFO - Optimizing ONNX model
2021-05-25 13:26:14,032 - INFO - After optimization: Cast -3 (18->15), Concat -2 (15->13), Const -68 (117->49), Gather -2 (12->10), Identity -12 (12->0), ReduceProd -2 (12->10), Reshape -1 (12->11), Shape -1 (6->5), Transpose -1 (30->29), Unsqueeze -2 (21->19)
2021-05-25 13:26:14,043 - INFO -
2021-05-25 13:26:14,043 - INFO - Successfully converted TensorFlow model saved-model to ONNX
2021-05-25 13:26:14,043 - INFO - Model inputs: ['inputparticle:0', 'inputsv:0']
2021-05-25 13:26:14,043 - INFO - Model outputs: ['output']
2021-05-25 13:26:14,043 - INFO - ONNX model is saved at model.onnx
Versions:
- tf2onnx: 1.8.4
- tf: 2.4.1
- onnxruntime: 1.7.0
- keras: 2.4.0
- python: 3.7.10
Yes please share a zip with the saved model and the onnx model. Normally the model produces correct results if it converts without error but we do have some some transformations that are technically incorrect for NaN values. Did you try any inputs other than np.zeros? You might want to try it on actual test data. I'm wondering if some normalization of the input is creating a division by 0 when all the inputs are 0.
I assume this was resolved.
For @jeffkrupa and anybody who will encounter this: try to downgrade tf2onnx. I had the same issue with my model and downgrading to 1.7.2 worked for me.
Hey @juliakorovsky can you upload a saved model and onnx file so we can reproduce this? I'm thinking optimization of Select/Where ops may be coming up here. Also double-check using the latest tf2onnx. We did a fix for a somewhat similar issue recently.
pip uninstall tf2onnx
pip install git+https://github.com/onnx/tensorflow-onnx
@TomWildenhain-Microsoft No, unfortunately I can't share the model. But I can create the list of all the ops used in the model, do you want me to do that? I checked with latest version hour ago, it does not work. After version 1.7.2 it doesn't work (even if installing directly from Git).
Can you generate and share a similar model with the same error? (you can probably skip the training step)
@TomWildenhain-Microsoft Actually I have no idea what could cause this, so I don't know what I should simulate. I know it's not much help but at least all the ops from my Tensorflow Graph: 'LoopCond', 'Sigmoid', 'TensorArrayV3', 'StopGradient', 'TensorArrayReadV3', 'TensorArrayGatherV3', 'Softmax', 'Max', 'PlaceholderWithDefault', 'GatherV2', 'Sum', 'TensorArrayWriteV3', 'Select', 'Assert', 'MatMul', 'ExpandDims', 'All', 'LessEqual', 'Tanh', 'Const', 'GreaterEqual', 'Enter', 'Add', 'TensorArrayScatterV3', 'Maximum', 'LogicalAnd', 'Switch', 'Shape', 'Split', 'NextIteration', 'Identity', 'Squeeze', 'GatherNd', 'Tile', 'ReverseSequence', 'Range', 'Fill', 'Mul', 'Minimum', 'Exit', 'Cast', 'Relu', 'Conv2D', 'Reshape', 'Prod', 'Placeholder', 'Rsqrt', 'BiasAdd', 'StridedSlice', 'Less', 'LogicalOr', 'Pack', 'Transpose', 'Sub', 'Equal', 'LogicalNot', 'BatchMatMul', 'Greater', 'TensorArraySizeV3', 'ConcatV2', 'Merge'.
I hope someone who could share their model will help. Also I don't know if this is important, because I don't know if you use tensorflow fold_constants function in this repo in any way, but in latest versions of TF fold_constants break the graph of my model. Maybe these issues are connected, maybe not, but I thougth I should mention it just in case.
Interesting. That op list might help. Are you able to get an op list of the working and failing onnx models? I suspect you will find "Where" in the working model but not in the failing model. Also can you send a picture of the graph where the Select op occurs using Netron?
@TomWildenhain-Microsoft I know how to extract ops in Tensorflow, how should I do it with Onnx (I'm using Python and onnx and onnxruntime libraries)?
Try this:
from collections import Counter
import onnx
model = onnx.load("path")
ops = Counter(node.op_type for node in model.graph.node)
@TomWildenhain-Microsoft Working model:
'Transpose': 23, 'Add': 17, 'Cast': 16, 'Unsqueeze': 13, 'Squeeze': 10, 'Conv': 8, 'Mul': 8, 'Concat': 8, 'Shape': 5, 'Tanh': 5, 'Reshape': 5, 'Relu': 4, 'Slice': 4, 'Gather': 3, 'Scan': 2, 'Less': 2, 'ReverseSequence': 2, 'Gemm': 2, 'ReduceProd': 2, 'Sub': 1, 'ReduceMin': 1, 'Greater': 1, 'Not': 1, 'And': 1, 'GRU': 1, 'GatherND': 1, 'Tile': 1, 'Range': 1, 'MatMul': 1, 'Expand': 1, 'Loop': 1
Total 31 different ops
Non-working model:
'Add': 17, 'Cast': 13, 'Unsqueeze': 11, 'Squeeze': 10, 'Reshape': 10, 'Conv': 8, 'Mul': 8, 'Transpose': 6, 'Concat': 6, 'Tanh': 5, 'Relu': 4, 'Shape': 4, 'Slice': 4, 'Gather': 2, 'Less': 2, 'Scan': 2, 'ReverseSequence': 2, 'Gemm': 2, 'Sub': 1, 'ReduceMin': 1, 'Greater': 1, 'Not': 1, 'And': 1, 'GRU': 1, 'GatherND': 1, 'Tile': 1, 'Range': 1, 'MatMul': 1, 'Expand': 1, 'Loop': 1
Total 30 different ops.
@TomWildenhain-Microsoft Also forgot to mention I use opset 11 and graphdef format (I got successful conversion with 1.7.2 and opsets 11 and 12).
Wow those are very similar. The transpose difference is from the transpose optimizer. Can you edit your tf2onnx installation to disable the optimizers? Set this line to False: https://github.com/onnx/tensorflow-onnx/blob/da324a50953ce5ce9202e3e86f8404ed2fdc8dea/tf2onnx/optimizer/init.py#L52
Then see if conversion works (on the latest tf2onnx)
@TomWildenhain-Microsoft I’ve also noticed there’s no ReduceProd in non-working model. Is that expected? I’ll try to disable optimizers tomorrow.
Good catch. If you can give pictures of the models in netron around the ReduceProd area, maybe that will give a hint.
@TomWildenhain-Microsoft I set optimization line to False and got message "After optimization: no change" on the terminal. But model still produces Nan. On the Netron graph I see like the whole picture was mirrored in comparison to both working and non-working models (I don't now if that's ok).
I attach pictures of ReduceProd nodes (or places where they should be) for all three models: working, non-working, non-optimized.
Here's working model:
Here's non-working model:
Here's non-optimized model:
Ah those are just being used to convince dimensions for the shapes. Reshape won't introduce/remove nan values so this can't be the cause.
I can try to get you a script that finds where the NaN values originate from.
@TomWildenhain-Microsoft That would be cool, thanks.
Hi @TomWildenhain-Microsoft and @juliakorovsky, thanks for your help. I tried with and without optimizations, as well as back to tf2onnx==1.7.2, but am still encountering the same nan issue where direct inference in tensorflow works but onnx does not. Here is the model along with the attempted conversion (with optimization)
2021-11-16 12:29:00,056 - INFO - tf2onnx: outputs: ['Identity:0']
2021-11-16 12:29:00,090 - INFO - tf2onnx.tfonnx: Using tensorflow=2.6.0, onnx=1.10.2, tf2onnx=1.7.2/995bd6
2021-11-16 12:29:00,090 - INFO - tf2onnx.tfonnx: Using opset <onnx, 8>
2021-11-16 12:29:00,111 - INFO - tf2onnx.tf_utils: Computed 0 values for constant folding
2021-11-16 12:29:00,248 - VERBOSE - tf2onnx.tfonnx: Mapping TF node to ONNX node(s)
2021-11-16 12:29:00,273 - VERBOSE - tf2onnx.tfonnx: Summay Stats:
tensorflow ops: Counter({'Const': 144, 'BiasAdd': 13, 'Transpose': 12, 'GatherV2': 12, 'Prod': 12, 'Reshape': 12, 'Relu': 12, 'MatMul': 10, 'ConcatV2': 9, 'ExpandDims': 9, 'Conv2D': 9, 'Squeeze': 9, 'Identity': 7, 'Shape': 6, 'Pack': 6, 'Mul': 3, 'AddV2': 3, 'Placeholder': 2, 'NoOp': 1, 'Sum': 1, 'Softmax': 1})
tensorflow attr: Counter({'dtype': 146, 'value': 144, 'T': 134, 'Tidx': 22, 'data_format': 22, 'N': 15, 'keep_dims': 13, 'Tperm': 12, 'batch_dims': 12, 'Taxis': 12, 'Tindices': 12, 'Tparams': 12, 'Tshape': 12, 'transpose_a': 10, 'transpose_b': 10, 'Tdim': 9, 'strides': 9, 'use_cudnn_on_gpu': 9, 'explicit_paddings': 9, 'padding': 9, 'dilations': 9, 'squeeze_dims': 9, 'out_type': 6, 'axis': 6, 'shape': 2, '_acd_function_control_output': 1})
onnx mapped: Counter({'Const': 111, 'BiasAdd': 13, 'Transpose': 12, 'GatherV2': 12, 'Prod': 12, 'Reshape': 12, 'Relu': 12, 'MatMul': 10, 'ConcatV2': 9, 'ExpandDims': 9, 'Conv2D': 9, 'Squeeze': 9, 'Identity': 7, 'Shape': 6, 'Pack': 6, 'Mul': 3, 'AddV2': 3, 'Placeholder': 2, 'Sum': 1, 'Softmax': 1})
onnx unmapped: Counter()
2021-11-16 12:29:00,273 - INFO - tf2onnx.optimizer: Optimizing ONNX model
2021-11-16 12:29:00,274 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,295 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: Const -55 (111->56), Transpose -2 (30->28)
2021-11-16 12:29:00,295 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,390 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,390 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,405 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,405 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,418 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,418 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,463 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: Cast -5 (18->13), Concat -3 (15->12), Const -13 (56->43), Gather -4 (12->8), ReduceProd -4 (12->8), Shape -2 (6->4), Unsqueeze -4 (21->17)
2021-11-16 12:29:00,463 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,477 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: Identity -8 (8->0)
2021-11-16 12:29:00,477 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,488 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,488 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,503 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2021-11-16 12:29:00,503 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,514 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,514 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,526 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,526 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,537 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,537 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,555 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: Reshape -2 (12->10)
2021-11-16 12:29:00,555 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,566 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2021-11-16 12:29:00,566 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,577 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,577 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,591 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2021-11-16 12:29:00,591 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,604 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,604 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,615 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,616 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,627 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,627 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,640 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: no change
2021-11-16 12:29:00,640 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,653 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2021-11-16 12:29:00,653 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,664 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,666 - INFO - tf2onnx.optimizer: After optimization: Cast -5 (18->13), Concat -3 (15->12), Const -68 (111->43), Gather -4 (12->8), Identity -8 (8->0), ReduceProd -4 (12->8), Reshape -2 (12->10), Shape -2 (6->4), Transpose -2 (30->28), Unsqueeze -4 (21->17)
2021-11-16 12:29:00,674 - INFO - tf2onnx:
2021-11-16 12:29:00,674 - INFO - tf2onnx: Successfully converted TensorFlow model saved-model-187-0.775-0.770-0.575-0.590.hdf5 to ONNX
2021-11-16 12:29:00,677 - INFO - tf2onnx: ONNX model is saved at saved-model-187-0.775-0.770-0.575-0.590.onnx