tensorflow-onnx icon indicating copy to clipboard operation
tensorflow-onnx copied to clipboard

onnx output nan

Open jeffkrupa opened this issue 3 years ago • 19 comments

Hi,

I am converting the model in from saved-model to onnx using python -m tf2onnx.convert --saved-model saved-model --output model.onnx. More details are below.

The model inference seems to work in keras

>>> from keras.models import load_model
>>> import tensorflow as tf
>>> keras_model = load_model('saved-model')
>>> x = np.zeros(shape=(1,60,18))
>>> y = np.zeros(shape=(1,5,6))
>>> keras_model.predict([x,y])
array([[9.994024e-01, 5.975381e-04]], dtype=float32)

however the output is nan in onnxruntime on the onnx file produced with tf2onnx

>>> import onnxruntime as rt
>>> sess = rt.InferenceSession('model.onnx')
>>> x = x.astype(np.float32)
>>> y = y.astype(np.float32)
>>> sess.run([sess.get_outputs()[0].name],{sess.get_inputs()[0].name:x, sess.get_inputs()[1].name:y})
[array([[nan, nan]], dtype=float32)]

There doesn't seem to be anything in the conversion step that is failing. What else could be going wrong? I can share the model if needed. Thank you.

>>> keras_model.summary()
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
inputParticle (InputLayer)      [(None, 60, 18)]     0                                            
__________________________________________________________________________________________________
inputSV (InputLayer)            [(None, 5, 6)]       0                                            
__________________________________________________________________________________________________
inputNormParticle (BatchNormali (None, 60, 18)       72          inputParticle[0][0]              
__________________________________________________________________________________________________
inputNormSV (BatchNormalization (None, 5, 6)         24          inputSV[0][0]                    
__________________________________________________________________________________________________
XdotRR (Lambda)                 (None, 3540, 18)     0           inputNormParticle[0][0]          
__________________________________________________________________________________________________
XdotRS (Lambda)                 (None, 3540, 18)     0           inputNormParticle[0][0]          
__________________________________________________________________________________________________
XdotRK (Lambda)                 (None, 300, 18)      0           inputParticle[0][0]              
__________________________________________________________________________________________________
YdotRV (Lambda)                 (None, 300, 6)       0           inputNormSV[0][0]                
__________________________________________________________________________________________________
Bpp (Lambda)                    (None, 3540, 36)     0           XdotRR[0][0]                     
                                                                 XdotRS[0][0]                     
__________________________________________________________________________________________________
Bvp (Lambda)                    (None, 300, 24)      0           XdotRK[0][0]                     
                                                                 YdotRV[0][0]                     
__________________________________________________________________________________________________
convOneParticle (Conv1D)        (None, 3540, 60)     2220        Bpp[0][0]                        
__________________________________________________________________________________________________
convOneSV (Conv1D)              (None, 300, 60)      1500        Bvp[0][0]                        
__________________________________________________________________________________________________
convTwoParticle (Conv1D)        (None, 3540, 30)     1830        convOneParticle[0][0]            
__________________________________________________________________________________________________
convTwoSV (Conv1D)              (None, 300, 30)      1830        convOneSV[0][0]                  
__________________________________________________________________________________________________
convThreeParticle (Conv1D)      (None, 3540, 20)     620         convTwoParticle[0][0]            
__________________________________________________________________________________________________
convThreeSV (Conv1D)            (None, 300, 20)      620         convTwoSV[0][0]                  
__________________________________________________________________________________________________
Epp (BatchNormalization)        (None, 3540, 20)     80          convThreeParticle[0][0]          
__________________________________________________________________________________________________
Evp (BatchNormalization)        (None, 300, 20)      80          convThreeSV[0][0]                
__________________________________________________________________________________________________
EppBar (Lambda)                 (None, 60, 20)       0           Epp[0][0]                        
__________________________________________________________________________________________________
EvpBar (Lambda)                 (None, 60, 20)       0           Evp[0][0]                        
__________________________________________________________________________________________________
C (Lambda)                      (None, 60, 58)       0           inputParticle[0][0]              
                                                                 EppBar[0][0]                     
                                                                 EvpBar[0][0]                     
__________________________________________________________________________________________________
convPredictOne (Conv1D)         (None, 60, 60)       3540        C[0][0]                          
__________________________________________________________________________________________________
convPredictTwo (Conv1D)         (None, 60, 30)       1830        convPredictOne[0][0]             
__________________________________________________________________________________________________
O (Conv1D)                      (None, 60, 24)       744         convPredictTwo[0][0]             
__________________________________________________________________________________________________
OBar (Lambda)                   (None, 24)           0           O[0][0]                          
__________________________________________________________________________________________________
denseEndOne (Dense)             (None, 50)           1250        OBar[0][0]                       
__________________________________________________________________________________________________
normEndOne (BatchNormalization) (None, 50)           200         denseEndOne[0][0]                
__________________________________________________________________________________________________
denseEndTwo (Dense)             (None, 20)           1020        normEndOne[0][0]                 
__________________________________________________________________________________________________
denseEndThree (Dense)           (None, 10)           210         denseEndTwo[0][0]                
__________________________________________________________________________________________________
denseEndFour (Dense)            (None, 5)            55          denseEndThree[0][0]              
__________________________________________________________________________________________________
output (Dense)                  (None, 2)            12          denseEndFour[0][0]               
==================================================================================================
Total params: 17,737
Trainable params: 17,509
Non-trainable params: 228
__________________________________________________________________________________________________
python -m tf2onnx.convert --saved-model saved-model --output model.onnx
  constant_folding: Graph size after: 318 nodes (-125), 417 edges (-159), time = 17.982ms.
  function_optimizer: function_optimizer did nothing. time = 0.218ms.
  constant_folding: Graph size after: 318 nodes (0), 417 edges (0), time = 6.774ms.
  function_optimizer: function_optimizer did nothing. time = 0.169ms.

2021-05-25 13:26:13,213 - INFO - Using tensorflow=2.4.1, onnx=1.8.0, tf2onnx=1.8.4/cd55bf
2021-05-25 13:26:13,213 - INFO - Using opset <onnx, 9>
2021-05-25 13:26:13,235 - INFO - Computed 0 values for constant folding
2021-05-25 13:26:13,433 - INFO - Optimizing ONNX model
2021-05-25 13:26:14,032 - INFO - After optimization: Cast -3 (18->15), Concat -2 (15->13), Const -68 (117->49), Gather -2 (12->10), Identity -12 (12->0), ReduceProd -2 (12->10), Reshape -1 (12->11), Shape -1 (6->5), Transpose -1 (30->29), Unsqueeze -2 (21->19)
2021-05-25 13:26:14,043 - INFO - 
2021-05-25 13:26:14,043 - INFO - Successfully converted TensorFlow model saved-model to ONNX
2021-05-25 13:26:14,043 - INFO - Model inputs: ['inputparticle:0', 'inputsv:0']
2021-05-25 13:26:14,043 - INFO - Model outputs: ['output']
2021-05-25 13:26:14,043 - INFO - ONNX model is saved at model.onnx

Versions:

  • tf2onnx: 1.8.4
  • tf: 2.4.1
  • onnxruntime: 1.7.0
  • keras: 2.4.0
  • python: 3.7.10

jeffkrupa avatar May 25 '21 14:05 jeffkrupa

Yes please share a zip with the saved model and the onnx model. Normally the model produces correct results if it converts without error but we do have some some transformations that are technically incorrect for NaN values. Did you try any inputs other than np.zeros? You might want to try it on actual test data. I'm wondering if some normalization of the input is creating a division by 0 when all the inputs are 0.

TomWildenhain-Microsoft avatar May 25 '21 17:05 TomWildenhain-Microsoft

I assume this was resolved.

guschmue avatar Aug 02 '21 16:08 guschmue

For @jeffkrupa and anybody who will encounter this: try to downgrade tf2onnx. I had the same issue with my model and downgrading to 1.7.2 worked for me.

juliakorovsky avatar Aug 24 '21 20:08 juliakorovsky

Hey @juliakorovsky can you upload a saved model and onnx file so we can reproduce this? I'm thinking optimization of Select/Where ops may be coming up here. Also double-check using the latest tf2onnx. We did a fix for a somewhat similar issue recently.

pip uninstall tf2onnx pip install git+https://github.com/onnx/tensorflow-onnx

TomWildenhain-Microsoft avatar Aug 24 '21 20:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft No, unfortunately I can't share the model. But I can create the list of all the ops used in the model, do you want me to do that? I checked with latest version hour ago, it does not work. After version 1.7.2 it doesn't work (even if installing directly from Git).

juliakorovsky avatar Aug 24 '21 20:08 juliakorovsky

Can you generate and share a similar model with the same error? (you can probably skip the training step)

TomWildenhain-Microsoft avatar Aug 24 '21 20:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft Actually I have no idea what could cause this, so I don't know what I should simulate. I know it's not much help but at least all the ops from my Tensorflow Graph: 'LoopCond', 'Sigmoid', 'TensorArrayV3', 'StopGradient', 'TensorArrayReadV3', 'TensorArrayGatherV3', 'Softmax', 'Max', 'PlaceholderWithDefault', 'GatherV2', 'Sum', 'TensorArrayWriteV3', 'Select', 'Assert', 'MatMul', 'ExpandDims', 'All', 'LessEqual', 'Tanh', 'Const', 'GreaterEqual', 'Enter', 'Add', 'TensorArrayScatterV3', 'Maximum', 'LogicalAnd', 'Switch', 'Shape', 'Split', 'NextIteration', 'Identity', 'Squeeze', 'GatherNd', 'Tile', 'ReverseSequence', 'Range', 'Fill', 'Mul', 'Minimum', 'Exit', 'Cast', 'Relu', 'Conv2D', 'Reshape', 'Prod', 'Placeholder', 'Rsqrt', 'BiasAdd', 'StridedSlice', 'Less', 'LogicalOr', 'Pack', 'Transpose', 'Sub', 'Equal', 'LogicalNot', 'BatchMatMul', 'Greater', 'TensorArraySizeV3', 'ConcatV2', 'Merge'.

I hope someone who could share their model will help. Also I don't know if this is important, because I don't know if you use tensorflow fold_constants function in this repo in any way, but in latest versions of TF fold_constants break the graph of my model. Maybe these issues are connected, maybe not, but I thougth I should mention it just in case.

juliakorovsky avatar Aug 24 '21 20:08 juliakorovsky

Interesting. That op list might help. Are you able to get an op list of the working and failing onnx models? I suspect you will find "Where" in the working model but not in the failing model. Also can you send a picture of the graph where the Select op occurs using Netron?

TomWildenhain-Microsoft avatar Aug 24 '21 21:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft I know how to extract ops in Tensorflow, how should I do it with Onnx (I'm using Python and onnx and onnxruntime libraries)?

juliakorovsky avatar Aug 24 '21 21:08 juliakorovsky

Try this:

from collections import Counter
import onnx
model = onnx.load("path")
ops = Counter(node.op_type for node in model.graph.node)

TomWildenhain-Microsoft avatar Aug 24 '21 21:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft Working model:

'Transpose': 23, 'Add': 17, 'Cast': 16, 'Unsqueeze': 13, 'Squeeze': 10, 'Conv': 8, 'Mul': 8, 'Concat': 8, 'Shape': 5, 'Tanh': 5, 'Reshape': 5, 'Relu': 4, 'Slice': 4, 'Gather': 3, 'Scan': 2, 'Less': 2, 'ReverseSequence': 2, 'Gemm': 2, 'ReduceProd': 2, 'Sub': 1, 'ReduceMin': 1, 'Greater': 1, 'Not': 1, 'And': 1, 'GRU': 1, 'GatherND': 1, 'Tile': 1, 'Range': 1, 'MatMul': 1, 'Expand': 1, 'Loop': 1

Total 31 different ops

Non-working model:

'Add': 17, 'Cast': 13, 'Unsqueeze': 11, 'Squeeze': 10, 'Reshape': 10, 'Conv': 8, 'Mul': 8, 'Transpose': 6, 'Concat': 6, 'Tanh': 5, 'Relu': 4, 'Shape': 4, 'Slice': 4, 'Gather': 2, 'Less': 2, 'Scan': 2, 'ReverseSequence': 2, 'Gemm': 2, 'Sub': 1, 'ReduceMin': 1, 'Greater': 1, 'Not': 1, 'And': 1, 'GRU': 1, 'GatherND': 1, 'Tile': 1, 'Range': 1, 'MatMul': 1, 'Expand': 1, 'Loop': 1

Total 30 different ops.

juliakorovsky avatar Aug 24 '21 21:08 juliakorovsky

@TomWildenhain-Microsoft Also forgot to mention I use opset 11 and graphdef format (I got successful conversion with 1.7.2 and opsets 11 and 12).

juliakorovsky avatar Aug 24 '21 21:08 juliakorovsky

Wow those are very similar. The transpose difference is from the transpose optimizer. Can you edit your tf2onnx installation to disable the optimizers? Set this line to False: https://github.com/onnx/tensorflow-onnx/blob/da324a50953ce5ce9202e3e86f8404ed2fdc8dea/tf2onnx/optimizer/init.py#L52

Then see if conversion works (on the latest tf2onnx)

TomWildenhain-Microsoft avatar Aug 24 '21 22:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft I’ve also noticed there’s no ReduceProd in non-working model. Is that expected? I’ll try to disable optimizers tomorrow.

juliakorovsky avatar Aug 24 '21 22:08 juliakorovsky

Good catch. If you can give pictures of the models in netron around the ReduceProd area, maybe that will give a hint.

TomWildenhain-Microsoft avatar Aug 24 '21 23:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft I set optimization line to False and got message "After optimization: no change" on the terminal. But model still produces Nan. On the Netron graph I see like the whole picture was mirrored in comparison to both working and non-working models (I don't now if that's ok).

I attach pictures of ReduceProd nodes (or places where they should be) for all three models: working, non-working, non-optimized.

Here's working model:

working_model_netron

Here's non-working model:

nonworking_model_netron

Here's non-optimized model: wo_optimizers_model_netron

juliakorovsky avatar Aug 25 '21 08:08 juliakorovsky

Ah those are just being used to convince dimensions for the shapes. Reshape won't introduce/remove nan values so this can't be the cause.

I can try to get you a script that finds where the NaN values originate from.

TomWildenhain-Microsoft avatar Aug 25 '21 16:08 TomWildenhain-Microsoft

@TomWildenhain-Microsoft That would be cool, thanks.

juliakorovsky avatar Aug 25 '21 17:08 juliakorovsky

Hi @TomWildenhain-Microsoft and @juliakorovsky, thanks for your help. I tried with and without optimizations, as well as back to tf2onnx==1.7.2, but am still encountering the same nan issue where direct inference in tensorflow works but onnx does not. Here is the model along with the attempted conversion (with optimization)

2021-11-16 12:29:00,056 - INFO - tf2onnx: outputs: ['Identity:0']
2021-11-16 12:29:00,090 - INFO - tf2onnx.tfonnx: Using tensorflow=2.6.0, onnx=1.10.2, tf2onnx=1.7.2/995bd6
2021-11-16 12:29:00,090 - INFO - tf2onnx.tfonnx: Using opset <onnx, 8>
2021-11-16 12:29:00,111 - INFO - tf2onnx.tf_utils: Computed 0 values for constant folding
2021-11-16 12:29:00,248 - VERBOSE - tf2onnx.tfonnx: Mapping TF node to ONNX node(s)
2021-11-16 12:29:00,273 - VERBOSE - tf2onnx.tfonnx: Summay Stats:
	tensorflow ops: Counter({'Const': 144, 'BiasAdd': 13, 'Transpose': 12, 'GatherV2': 12, 'Prod': 12, 'Reshape': 12, 'Relu': 12, 'MatMul': 10, 'ConcatV2': 9, 'ExpandDims': 9, 'Conv2D': 9, 'Squeeze': 9, 'Identity': 7, 'Shape': 6, 'Pack': 6, 'Mul': 3, 'AddV2': 3, 'Placeholder': 2, 'NoOp': 1, 'Sum': 1, 'Softmax': 1})
	tensorflow attr: Counter({'dtype': 146, 'value': 144, 'T': 134, 'Tidx': 22, 'data_format': 22, 'N': 15, 'keep_dims': 13, 'Tperm': 12, 'batch_dims': 12, 'Taxis': 12, 'Tindices': 12, 'Tparams': 12, 'Tshape': 12, 'transpose_a': 10, 'transpose_b': 10, 'Tdim': 9, 'strides': 9, 'use_cudnn_on_gpu': 9, 'explicit_paddings': 9, 'padding': 9, 'dilations': 9, 'squeeze_dims': 9, 'out_type': 6, 'axis': 6, 'shape': 2, '_acd_function_control_output': 1})
	onnx mapped: Counter({'Const': 111, 'BiasAdd': 13, 'Transpose': 12, 'GatherV2': 12, 'Prod': 12, 'Reshape': 12, 'Relu': 12, 'MatMul': 10, 'ConcatV2': 9, 'ExpandDims': 9, 'Conv2D': 9, 'Squeeze': 9, 'Identity': 7, 'Shape': 6, 'Pack': 6, 'Mul': 3, 'AddV2': 3, 'Placeholder': 2, 'Sum': 1, 'Softmax': 1})
	onnx unmapped: Counter()
2021-11-16 12:29:00,273 - INFO - tf2onnx.optimizer: Optimizing ONNX model
2021-11-16 12:29:00,274 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,295 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: Const -55 (111->56), Transpose -2 (30->28)
2021-11-16 12:29:00,295 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,390 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,390 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,405 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,405 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,418 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,418 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,463 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: Cast -5 (18->13), Concat -3 (15->12), Const -13 (56->43), Gather -4 (12->8), ReduceProd -4 (12->8), Shape -2 (6->4), Unsqueeze -4 (21->17)
2021-11-16 12:29:00,463 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,477 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: Identity -8 (8->0)
2021-11-16 12:29:00,477 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,488 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,488 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,503 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2021-11-16 12:29:00,503 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,514 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,514 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,526 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,526 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,537 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,537 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,555 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: Reshape -2 (12->10)
2021-11-16 12:29:00,555 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,566 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2021-11-16 12:29:00,566 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,577 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,577 - VERBOSE - tf2onnx.optimizer: Apply optimize_transpose
2021-11-16 12:29:00,591 - VERBOSE - tf2onnx.optimizer.TransposeOptimizer: no change
2021-11-16 12:29:00,591 - VERBOSE - tf2onnx.optimizer: Apply remove_redundant_upsample
2021-11-16 12:29:00,604 - VERBOSE - tf2onnx.optimizer.UpsampleOptimizer: no change
2021-11-16 12:29:00,604 - VERBOSE - tf2onnx.optimizer: Apply fold_constants
2021-11-16 12:29:00,615 - VERBOSE - tf2onnx.optimizer.ConstFoldOptimizer: no change
2021-11-16 12:29:00,616 - VERBOSE - tf2onnx.optimizer: Apply loop_optimizer
2021-11-16 12:29:00,627 - VERBOSE - tf2onnx.optimizer.LoopOptimizer: no change
2021-11-16 12:29:00,627 - VERBOSE - tf2onnx.optimizer: Apply merge_duplication
2021-11-16 12:29:00,640 - VERBOSE - tf2onnx.optimizer.MergeDuplicatedNodesOptimizer: no change
2021-11-16 12:29:00,640 - VERBOSE - tf2onnx.optimizer: Apply remove_identity
2021-11-16 12:29:00,653 - VERBOSE - tf2onnx.optimizer.IdentityOptimizer: no change
2021-11-16 12:29:00,653 - VERBOSE - tf2onnx.optimizer: Apply remove_back_to_back
2021-11-16 12:29:00,664 - VERBOSE - tf2onnx.optimizer.BackToBackOptimizer: no change
2021-11-16 12:29:00,666 - INFO - tf2onnx.optimizer: After optimization: Cast -5 (18->13), Concat -3 (15->12), Const -68 (111->43), Gather -4 (12->8), Identity -8 (8->0), ReduceProd -4 (12->8), Reshape -2 (12->10), Shape -2 (6->4), Transpose -2 (30->28), Unsqueeze -4 (21->17)
2021-11-16 12:29:00,674 - INFO - tf2onnx: 
2021-11-16 12:29:00,674 - INFO - tf2onnx: Successfully converted TensorFlow model saved-model-187-0.775-0.770-0.575-0.590.hdf5 to ONNX
2021-11-16 12:29:00,677 - INFO - tf2onnx: ONNX model is saved at saved-model-187-0.775-0.770-0.575-0.590.onnx

jeffkrupa avatar Nov 16 '21 18:11 jeffkrupa