onnx-tensorflow icon indicating copy to clipboard operation
onnx-tensorflow copied to clipboard

Can not use converted ONNX -> TF graph independently [py_func issue]

Open nmakhotkin opened this issue 6 years ago • 67 comments

I am trying to export some ONNX model to Tensorflow and then use it for inference (possibly on another environment). Here is an example of exporting MNIST model:

import numpy as np
import onnx
from onnx_tf.backend import prepare
import tensorflow as tf


print('loading onnx model')
onnx_model = onnx.load('train/model.onnx')

print('prepare tf model')
tf_rep = prepare(onnx_model)
print(tf_rep.predict_net)
print('-----')
print(tf_rep.predict_net.tensor_dict)

test = np.random.rand(1, 1, 28, 28)

out = tf_rep.run(test)._0
print(out)

with tf.Session() as persisted_sess:
    print("load graph")
    persisted_sess.graph.as_default()
    tf.import_graph_def(tf_rep.predict_net.graph.as_graph_def(), name='')
    # for op in persisted_sess.graph.get_operations():
    #    print(op)
    inp = persisted_sess.graph.get_tensor_by_name(
        tf_rep.predict_net.tensor_dict[tf_rep.predict_net.external_input[0]].name
    )
    out = persisted_sess.graph.get_tensor_by_name(
        tf_rep.predict_net.tensor_dict[tf_rep.predict_net.external_output[0]].name
    )
    res = persisted_sess.run(out, {inp: test})
    print(res)

tf_rep.export_graph('train/tf.pb')

The script above executes successfully and the prediction also runs successfully (res == out here). Now, I am importing the saved model in TF:

import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile


name = "train/tf.pb"

with tf.Session() as persisted_sess:
    print("load graph")
    with gfile.FastGFile(name, 'rb') as f:
        graph_def = tf.GraphDef()
        graph_def.ParseFromString(f.read())

    persisted_sess.graph.as_default()
    tf.import_graph_def(graph_def, name='')

    test = np.random.rand(1, 1, 28, 28).astype(np.float32)

    inp = persisted_sess.graph.get_tensor_by_name('0:0')
    out = persisted_sess.graph.get_tensor_by_name('LogSoftmax:0')
    feed_dict = {inp: test}

    classification = persisted_sess.run(out, feed_dict)

And now I got the error related to nonexistent PyFuncs:

2018-05-14 15:28:36.780899: W tensorflow/core/framework/op_kernel.cc:1198] tensorflow.python.framework.errors_impl.UnknownError: exceptions.KeyError: 'pyfunc_0'
         [[Node: PyFunc = PyFunc[Tin=[DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING], Tout=[DT_FLOAT], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/device:CPU:0"](transpose_2, PyFunc/input_1, PyFunc/input_1, PyFunc/input_3, PyFunc/input_4, PyFunc/input_5, PyFunc/input_6)]]

Full log: https://pastebin.com/0bQeMPTG But at the moment of exporting the model worked fine (see above). I did some investigation on what exact functions are being used in the TF graph:

(Pdb) from tensorflow.python.ops import script_ops
(Pdb) script_ops._py_funcs._funcs
{'pyfunc_0': <function py_pool at 0x7f395db05500>, 'pyfunc_1': <function py_pool at 0x7f395dab99b0>}
(Pdb) funcs = script_ops._py_funcs._funcs.values()
(Pdb) func = funcs[0]
(Pdb) func.func_name
'py_pool'
(Pdb) func.func_code
<code object py_pool at 0x7f395fb9deb0, file "/usr/local/lib/python2.7/dist-packages/onnx_tf/backends/backend_v1.py", line 94>
(Pdb)

So, I have a question: is this intended that TF graph uses external function from onnx_tf package? Or this is simply a bug? Is there any way to make this model independent of onnx and onnx-tf packages?

nmakhotkin avatar May 14 '18 12:05 nmakhotkin

I guess there are some pooling ops in your onnx pb. You could take a look of them. If they satisfy one of following conditions,

  • auto_pad not be set to SAME_UPPER or VALID
  • count_include_pad is 1

we use py_func to do _compatibility_pool because in tensorflow, there is no corresponding pool op.

We didn't consider the situation that user will want to do such you did. onnx -> tensorflow -> tensorflow

fumihwh avatar May 14 '18 13:05 fumihwh

Basically, this model is imported from PyTorch, the full net class is below:

import torch
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

So I do pytorch -> ONNX -> tensorflow and then try to do inference on tensorflow (the goal is to run tensorflow serving as the result) Btw, converted ONNX model works fine, moreover, converted pytorch -> onnx -> caffe2 model works fine. The problem is only for tensorflow.

nmakhotkin avatar May 14 '18 14:05 nmakhotkin

So I suppose these F.max_pool2d operations are converted to py_func

nmakhotkin avatar May 14 '18 14:05 nmakhotkin

@fumihwh we did consider onnx -> tensorflow -> tf serving path, that is why we have export_graph in our API,

@nmakhotkin unfortunately as @fumihwh pointed out, max_pool is a very complicated issue and we strive to strike a balance between logical clarity/conciseness, numerical precision, the need to pass all ONNX backend test and performance. The fix in your case might be simple since you are not padding your feature maps (thus "VALID" padding in TF terms), but please do allow us some time to come up with a more systematic fix.

@fumihwh this essentially boils down to the issue I raised to you on this PR (https://github.com/onnx/onnx-tensorflow/pull/83). Specifically and I quote:

And we should avoid using python function as much as possible because that would prevent us from serializing the graph (thus we can't pass the generated graph to tf_serving).

I think we should revert part of that PR to use native max pooling as much as possible. Your solution was better to reason and more concise, but my original implementation was there for a very practical reason.

tjingrant avatar May 14 '18 18:05 tjingrant

@nmakhotkin can you provide me with the onnx model generated by torch?

tjingrant avatar May 14 '18 20:05 tjingrant

@tjingrant yes, here is it (uploaded to GDrive): onnx model (generated by pytorch) - https://drive.google.com/file/d/13yJYYgQiiqxP8Khm-PZ5Q6JwxLi2w_4A/view?usp=sharing

original pytorch model - https://drive.google.com/file/d/11BJOI5ucsSmM-9aZBYVIBcDvf9ILihnU/view?usp=sharing

nmakhotkin avatar May 14 '18 20:05 nmakhotkin

@nmakhotkin would you like to try again with this PR https://github.com/onnx/onnx-tensorflow/pull/171/files ?

You can check out a different branch as well (https://github.com/onnx/onnx-tensorflow/tree/fix-pool).

tjingrant avatar May 14 '18 20:05 tjingrant

@tjingrant thanks! I'll try today (it is morning for me now) and will write the results here.

nmakhotkin avatar May 15 '18 06:05 nmakhotkin

@tjingrant The fix works! I just tested onnx-tf on fix-pool branch: I converted my onnx model to tensorflow again and model inference works! Now it is able to successfully recognize some examples from MNIST:

$ python tf_inference.py 
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
2018-05-15 11:49:27.417070: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA

Prediction of file 5.png: 6
Prediction of file 2.png: 2
Prediction of file 9.png: 9
Prediction of file 1.png: 1
Prediction of file 4.png: 4
Prediction of file 0.png: 0
Prediction of file 7.png: 7

Now there are no PyFunc ops in the graph. Full set of ops is below:

(Pdb) set([op.type for op in persisted_sess.graph.get_operations()])
set([u'MatMul', u'NoOp', u'LogSoftmax', u'Const', u'Sub', u'ExpandDims', u'Reshape', u'MaxPool', u'Transpose', u'Rank', u'Relu', u'Add', u'Identity', u'Pad', u'Split', u'Range', u'Mul', u'Pack', u'Placeholder', u'Conv2D', u'StridedSlice'])

P.S. now waiting when the PR is merged :)

nmakhotkin avatar May 15 '18 08:05 nmakhotkin

Still getting the Pyfunc error even when using the fix-pool branch.

The model I converted from is Pytorch's Resnet

I've been doing the same thing as @nmakhotkin is trying to do : Pytoch -> Onnx -> Tensorflow representation and then to pb file for running inference.

I was able to convert the mnsit example code from pytorch to a pb file but could not do the same for the resnet model

kartk avatar May 18 '18 06:05 kartk

@kartk Could you upload your onnx pb?

fumihwh avatar May 18 '18 06:05 fumihwh

The model I use is a slight modified Resnet called Hopenet.

Here is the IR representation : https://drive.google.com/file/d/1VRCHFq7lAIhQFEZYr2o0Ij-1xKjbgIf6/view?usp=sharing

Here is the Converted pb : https://drive.google.com/file/d/1PK45MwNDXPg-tTMe-M0errojUnVSMAXb/view?usp=sharing

kartk avatar May 18 '18 06:05 kartk

@kartk You should get an warning message says

UserWarning: Using the pooling op in compatibility mode.This means your graph cannot be serialized.
Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.

One layer in your network can not use native tensorflow op. We have to use compatible pool. I checked and it seems following layer:

input [1, 64, 112, 112]
pads [1, 1, 1, 1]
output [56, 56]
kernel [3, 3]
strides [2, 2]

If you want to use pool with "SAME" in tensorflow, the pads should be [0, 1, 0, 1].

fumihwh avatar May 19 '18 07:05 fumihwh

thanks @fumihwh.

I'm very new to pytorch and NN as a whole, where do i need to change the pads so that it'll be compatible with tensorflow ?

kartk avatar May 21 '18 07:05 kartk

Just tried to convert pretrained ResNet (resnet101) model to onnx, then to tensorflow. As @kartk said, there is still py_func presented in graph.

Is there a way to get rid of it completely somehow?

nmakhotkin avatar May 21 '18 08:05 nmakhotkin

@nmakhotkin to put it shortly, PyTorch's ResNet implementation is incorrect or more precisely, not faithful to the original paper. This might be unbelievable to you, but let me point you to another discussion thread where we discussed extensively about this topic (https://github.com/tensorflow/benchmarks/issues/134).

And let me quote the relevant part, for the first max-pooling layer in ResNet, here's what paddings are added in various frameworks:

Pytorch: Left 1, right 1. In this case this is equivalent to Left 1, right 0.
Caffe: Left 0, right 1.
TensorFlow SAME: Left 0, right 1.

This stems from the fact that PyTorch only supports symmetric pads. It is not a problem caused by onnx-tensorflow or Tensorflow per se, but rather an unfortunate consequence of the limitation of PyTorch.

tjingrant avatar May 21 '18 15:05 tjingrant

@nmakhotkin as a result, there is no semantic preserving AND serializable workaround. But we can try to give you an option to slightly alter the semantics of max pool so that you can serialize the incorrect version of ResNet exported from PyTorch; but expect some accuracy degradation of your model as a result.

tjingrant avatar May 21 '18 15:05 tjingrant

Thanks for the answer! Yes, it would be nice to have an additional option flag which will control this behavior (either to export precisely or not).

nmakhotkin avatar May 21 '18 16:05 nmakhotkin

@tjingrant are you planning to implement this workarund for the serialization of the PyTorch ResNet? It would be great! Thanks

inakinavarro avatar Jun 01 '18 11:06 inakinavarro

Hi, absolutely, but we might have other priorities in the meantime, like supporting onnx v1.2; sorry for the delay, my estimate is that it'll be there before the end of next Wed.

tjingrant avatar Jun 01 '18 14:06 tjingrant

That sounds great!! Thanks for your effort!!!

inakinavarro avatar Jun 01 '18 14:06 inakinavarro

@inakinavarro @nmakhotkin hi, a tentative PR to address this issue has been created https://github.com/onnx/onnx-tensorflow/pull/212.

@inakinavarro I've modified ur original script to use non-strict mode:

tf_backend.prepare(model, strict=False)

It seems to work now. Let me know if anything breaks and I'll follow up.

tjingrant avatar Jun 11 '18 01:06 tjingrant

@tjingrant Great!! Thanks a lot. I will test it ASAP and let you know.

inakinavarro avatar Jun 11 '18 14:06 inakinavarro

After installing onnx 1.2.2 and converting the ResNet-50 model from https://github.com/onnx/models/tree/master/resnet50 to a TF pb file using tf_backend.prepare(model, strict=False), I tried to run the converted model and got KeyError: 'pyfunc_0' error for the pool1_1 layer.

My understanding was that specifying strict=False may cause the network output to change since the semantics may change but that the network could be run (per PR #212). Has this change not been merged into v1.2.2?

asarah-github avatar Jun 22 '18 17:06 asarah-github

@asarah-github the PR has not made its way into any of our existing releases yet. It won't be there if you install a release version of onnx-tensorflow (I'm not sure if you have, or were you confusing onnx with onnx-tf). But anyhow, Can you do a master build of onnx-tensorflow and try again?

tjingrant avatar Jun 22 '18 17:06 tjingrant

@tjingrant Sorry for the confusion on the version. Anyway, I built from master and ran again. Now the conversion fails with the following error.

...
  File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 76, in prepare
    return cls.onnx_model_to_tensorflow_rep(model, strict)
  File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 87, in onnx_model_to_tensorflow_rep
    return cls._onnx_graph_to_tensorflow_rep(model.graph, model.opset_import, strict)
  File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 141, in _onnx_graph_to_tensorflow_rep
    onnx_node, tensor_dict, handlers, opset=opset, strict=strict)
  File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 236, in _onnx_node_to_tensorflow_op
    return handler.handle(node, tensor_dict=tensor_dict, strict=strict)
  File "./lib/python3.5/site-packages/onnx_tf/handlers/handler.py", line 59, in handle
    return ver_handle(node, **kwargs)
  File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/average_pool.py", line 17, in version_1
    kwargs.get("strict", True))
  File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/pool_mixin.py", line 68, in pool
    x = PadMixin.get_padding_as_op(x, pads)
  File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/pad_mixin.py", line 9, in get_padding_as_op
    num_dim = int(len(pads) / 2)
TypeError: object of type 'NoneType' has no len()

Any ideas?

asarah-github avatar Jun 22 '18 19:06 asarah-github

I am also getting similar error when I go from torch to onnx to tensorflow.

ValueError: callback pyfunc_0 is not found

 [[Node: prefix/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING], Tout=[DT_FLOAT], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/Relu, prefix/PyFunc/input_1, prefix/PyFunc/input_2, prefix/PyFunc/input_3, prefix/PyFunc/input_4, prefix/PyFunc/input_2, prefix/PyFunc/input_6, prefix/PyFunc/input_7)]]

Console error:

 UserWarning: Using the pooling op in compatibility mode.This means your graph cannot be serialized.Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.
  "correspond to Tensorflow SAME or VALID padding.", UserWarning)

PB model: https://drive.google.com/open?id=1gp1VF1lafDpxiqIUgVAgWvOeqVxoTAlh

@tjingrant @fumihwh Any ideas?

achalshah20 avatar Jul 25 '18 21:07 achalshah20

@asarah-github I test master version of resnet50 from https://github.com/onnx/models/tree/master/resnet50 and it works....

fumihwh avatar Jul 27 '18 00:07 fumihwh

@achalshah20 As warning says Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.. For example, in pytorch, if you set [1, 3, 5, 5], kernel [3, 3], pads [1, 1, 1, 1], it corresponds to "SAME" in tf. But if you set pads [2, 2, 2, 2], it doesn't work with default tf func. We should use compatibility mode and calculate pool result by manual. This is exactly what PyFunc is. And PyFunc will be irreversible, means you can not convert this pb to onnx.

fumihwh avatar Jul 27 '18 00:07 fumihwh

@fumihwh When you tested the master version of ResNet-50 did you do a master build of onnx-tf?

asarah-github avatar Jul 31 '18 22:07 asarah-github