onnx-tensorflow
onnx-tensorflow copied to clipboard
Can not use converted ONNX -> TF graph independently [py_func issue]
I am trying to export some ONNX model to Tensorflow and then use it for inference (possibly on another environment). Here is an example of exporting MNIST model:
import numpy as np
import onnx
from onnx_tf.backend import prepare
import tensorflow as tf
print('loading onnx model')
onnx_model = onnx.load('train/model.onnx')
print('prepare tf model')
tf_rep = prepare(onnx_model)
print(tf_rep.predict_net)
print('-----')
print(tf_rep.predict_net.tensor_dict)
test = np.random.rand(1, 1, 28, 28)
out = tf_rep.run(test)._0
print(out)
with tf.Session() as persisted_sess:
print("load graph")
persisted_sess.graph.as_default()
tf.import_graph_def(tf_rep.predict_net.graph.as_graph_def(), name='')
# for op in persisted_sess.graph.get_operations():
# print(op)
inp = persisted_sess.graph.get_tensor_by_name(
tf_rep.predict_net.tensor_dict[tf_rep.predict_net.external_input[0]].name
)
out = persisted_sess.graph.get_tensor_by_name(
tf_rep.predict_net.tensor_dict[tf_rep.predict_net.external_output[0]].name
)
res = persisted_sess.run(out, {inp: test})
print(res)
tf_rep.export_graph('train/tf.pb')
The script above executes successfully and the prediction also runs successfully (res
== out
here). Now, I am importing the saved model in TF:
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
name = "train/tf.pb"
with tf.Session() as persisted_sess:
print("load graph")
with gfile.FastGFile(name, 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
persisted_sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
test = np.random.rand(1, 1, 28, 28).astype(np.float32)
inp = persisted_sess.graph.get_tensor_by_name('0:0')
out = persisted_sess.graph.get_tensor_by_name('LogSoftmax:0')
feed_dict = {inp: test}
classification = persisted_sess.run(out, feed_dict)
And now I got the error related to nonexistent PyFuncs:
2018-05-14 15:28:36.780899: W tensorflow/core/framework/op_kernel.cc:1198] tensorflow.python.framework.errors_impl.UnknownError: exceptions.KeyError: 'pyfunc_0'
[[Node: PyFunc = PyFunc[Tin=[DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING], Tout=[DT_FLOAT], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/device:CPU:0"](transpose_2, PyFunc/input_1, PyFunc/input_1, PyFunc/input_3, PyFunc/input_4, PyFunc/input_5, PyFunc/input_6)]]
Full log: https://pastebin.com/0bQeMPTG But at the moment of exporting the model worked fine (see above). I did some investigation on what exact functions are being used in the TF graph:
(Pdb) from tensorflow.python.ops import script_ops
(Pdb) script_ops._py_funcs._funcs
{'pyfunc_0': <function py_pool at 0x7f395db05500>, 'pyfunc_1': <function py_pool at 0x7f395dab99b0>}
(Pdb) funcs = script_ops._py_funcs._funcs.values()
(Pdb) func = funcs[0]
(Pdb) func.func_name
'py_pool'
(Pdb) func.func_code
<code object py_pool at 0x7f395fb9deb0, file "/usr/local/lib/python2.7/dist-packages/onnx_tf/backends/backend_v1.py", line 94>
(Pdb)
So, I have a question: is this intended that TF graph uses external function from onnx_tf
package? Or this is simply a bug?
Is there any way to make this model independent of onnx
and onnx-tf
packages?
I guess there are some pooling ops in your onnx pb. You could take a look of them. If they satisfy one of following conditions,
-
auto_pad
not be set toSAME_UPPER
orVALID
-
count_include_pad
is 1
we use py_func
to do _compatibility_pool
because in tensorflow, there is no corresponding pool op.
We didn't consider the situation that user will want to do such you did.
onnx -> tensorflow -> tensorflow
Basically, this model is imported from PyTorch, the full net class is below:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
self.conv2_drop = nn.Dropout2d()
self.fc1 = nn.Linear(320, 50)
self.fc2 = nn.Linear(50, 10)
def forward(self, x):
x = F.relu(F.max_pool2d(self.conv1(x), 2))
x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
x = x.view(-1, 320)
x = F.relu(self.fc1(x))
x = F.dropout(x, training=self.training)
x = self.fc2(x)
return F.log_softmax(x, dim=1)
So I do pytorch -> ONNX -> tensorflow
and then try to do inference on tensorflow (the goal is to run tensorflow serving as the result)
Btw, converted ONNX model works fine, moreover, converted pytorch -> onnx -> caffe2
model works fine. The problem is only for tensorflow.
So I suppose these F.max_pool2d
operations are converted to py_func
@fumihwh we did consider onnx -> tensorflow -> tf serving path, that is why we have export_graph
in our API,
@nmakhotkin unfortunately as @fumihwh pointed out, max_pool is a very complicated issue and we strive to strike a balance between logical clarity/conciseness, numerical precision, the need to pass all ONNX backend test and performance. The fix in your case might be simple since you are not padding your feature maps (thus "VALID" padding in TF terms), but please do allow us some time to come up with a more systematic fix.
@fumihwh this essentially boils down to the issue I raised to you on this PR (https://github.com/onnx/onnx-tensorflow/pull/83). Specifically and I quote:
And we should avoid using python function as much as possible because that would prevent us from serializing the graph (thus we can't pass the generated graph to tf_serving).
I think we should revert part of that PR to use native max pooling as much as possible. Your solution was better to reason and more concise, but my original implementation was there for a very practical reason.
@nmakhotkin can you provide me with the onnx model generated by torch?
@tjingrant yes, here is it (uploaded to GDrive): onnx model (generated by pytorch) - https://drive.google.com/file/d/13yJYYgQiiqxP8Khm-PZ5Q6JwxLi2w_4A/view?usp=sharing
original pytorch model - https://drive.google.com/file/d/11BJOI5ucsSmM-9aZBYVIBcDvf9ILihnU/view?usp=sharing
@nmakhotkin would you like to try again with this PR https://github.com/onnx/onnx-tensorflow/pull/171/files ?
You can check out a different branch as well (https://github.com/onnx/onnx-tensorflow/tree/fix-pool).
@tjingrant thanks! I'll try today (it is morning for me now) and will write the results here.
@tjingrant The fix works! I just tested onnx-tf
on fix-pool
branch: I converted my onnx model to tensorflow again and model inference works! Now it is able to successfully recognize some examples from MNIST:
$ python tf_inference.py
/usr/local/lib/python2.7/dist-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
2018-05-15 11:49:27.417070: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
Prediction of file 5.png: 6
Prediction of file 2.png: 2
Prediction of file 9.png: 9
Prediction of file 1.png: 1
Prediction of file 4.png: 4
Prediction of file 0.png: 0
Prediction of file 7.png: 7
Now there are no PyFunc ops in the graph. Full set of ops is below:
(Pdb) set([op.type for op in persisted_sess.graph.get_operations()])
set([u'MatMul', u'NoOp', u'LogSoftmax', u'Const', u'Sub', u'ExpandDims', u'Reshape', u'MaxPool', u'Transpose', u'Rank', u'Relu', u'Add', u'Identity', u'Pad', u'Split', u'Range', u'Mul', u'Pack', u'Placeholder', u'Conv2D', u'StridedSlice'])
P.S. now waiting when the PR is merged :)
Still getting the Pyfunc error even when using the fix-pool branch.
The model I converted from is Pytorch's Resnet
I've been doing the same thing as @nmakhotkin is trying to do : Pytoch -> Onnx -> Tensorflow representation and then to pb file for running inference.
I was able to convert the mnsit example code from pytorch to a pb file but could not do the same for the resnet model
@kartk Could you upload your onnx pb?
The model I use is a slight modified Resnet called Hopenet.
Here is the IR representation : https://drive.google.com/file/d/1VRCHFq7lAIhQFEZYr2o0Ij-1xKjbgIf6/view?usp=sharing
Here is the Converted pb : https://drive.google.com/file/d/1PK45MwNDXPg-tTMe-M0errojUnVSMAXb/view?usp=sharing
@kartk You should get an warning message says
UserWarning: Using the pooling op in compatibility mode.This means your graph cannot be serialized.
Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.
One layer in your network can not use native tensorflow op. We have to use compatible pool. I checked and it seems following layer:
input [1, 64, 112, 112]
pads [1, 1, 1, 1]
output [56, 56]
kernel [3, 3]
strides [2, 2]
If you want to use pool with "SAME" in tensorflow, the pads should be [0, 1, 0, 1]
.
thanks @fumihwh.
I'm very new to pytorch and NN as a whole, where do i need to change the pads so that it'll be compatible with tensorflow ?
Just tried to convert pretrained ResNet
(resnet101) model to onnx, then to tensorflow. As @kartk said, there is still py_func
presented in graph.
Is there a way to get rid of it completely somehow?
@nmakhotkin to put it shortly, PyTorch's ResNet implementation is incorrect or more precisely, not faithful to the original paper. This might be unbelievable to you, but let me point you to another discussion thread where we discussed extensively about this topic (https://github.com/tensorflow/benchmarks/issues/134).
And let me quote the relevant part, for the first max-pooling layer in ResNet, here's what paddings are added in various frameworks:
Pytorch: Left 1, right 1. In this case this is equivalent to Left 1, right 0.
Caffe: Left 0, right 1.
TensorFlow SAME: Left 0, right 1.
This stems from the fact that PyTorch only supports symmetric pads. It is not a problem caused by onnx-tensorflow
or Tensorflow
per se, but rather an unfortunate consequence of the limitation of PyTorch.
@nmakhotkin as a result, there is no semantic preserving AND serializable workaround. But we can try to give you an option to slightly alter the semantics of max pool so that you can serialize the incorrect version of ResNet exported from PyTorch; but expect some accuracy degradation of your model as a result.
Thanks for the answer! Yes, it would be nice to have an additional option flag which will control this behavior (either to export precisely or not).
@tjingrant are you planning to implement this workarund for the serialization of the PyTorch ResNet? It would be great! Thanks
Hi, absolutely, but we might have other priorities in the meantime, like supporting onnx v1.2; sorry for the delay, my estimate is that it'll be there before the end of next Wed.
That sounds great!! Thanks for your effort!!!
@inakinavarro @nmakhotkin hi, a tentative PR to address this issue has been created https://github.com/onnx/onnx-tensorflow/pull/212.
@inakinavarro I've modified ur original script to use non-strict mode:
tf_backend.prepare(model, strict=False)
It seems to work now. Let me know if anything breaks and I'll follow up.
@tjingrant Great!! Thanks a lot. I will test it ASAP and let you know.
After installing onnx 1.2.2 and converting the ResNet-50 model from https://github.com/onnx/models/tree/master/resnet50 to a TF pb file using tf_backend.prepare(model, strict=False)
, I tried to run the converted model and got KeyError: 'pyfunc_0'
error for the pool1_1
layer.
My understanding was that specifying strict=False
may cause the network output to change since the semantics may change but that the network could be run (per PR #212). Has this change not been merged into v1.2.2?
@asarah-github the PR has not made its way into any of our existing releases yet. It won't be there if you install a release version of onnx-tensorflow (I'm not sure if you have, or were you confusing onnx with onnx-tf). But anyhow, Can you do a master build of onnx-tensorflow and try again?
@tjingrant Sorry for the confusion on the version. Anyway, I built from master and ran again. Now the conversion fails with the following error.
...
File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 76, in prepare
return cls.onnx_model_to_tensorflow_rep(model, strict)
File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 87, in onnx_model_to_tensorflow_rep
return cls._onnx_graph_to_tensorflow_rep(model.graph, model.opset_import, strict)
File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 141, in _onnx_graph_to_tensorflow_rep
onnx_node, tensor_dict, handlers, opset=opset, strict=strict)
File "./lib/python3.5/site-packages/onnx_tf/backend.py", line 236, in _onnx_node_to_tensorflow_op
return handler.handle(node, tensor_dict=tensor_dict, strict=strict)
File "./lib/python3.5/site-packages/onnx_tf/handlers/handler.py", line 59, in handle
return ver_handle(node, **kwargs)
File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/average_pool.py", line 17, in version_1
kwargs.get("strict", True))
File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/pool_mixin.py", line 68, in pool
x = PadMixin.get_padding_as_op(x, pads)
File "./lib/python3.5/site-packages/onnx_tf/handlers/backend/pad_mixin.py", line 9, in get_padding_as_op
num_dim = int(len(pads) / 2)
TypeError: object of type 'NoneType' has no len()
Any ideas?
I am also getting similar error when I go from torch to onnx to tensorflow.
ValueError: callback pyfunc_0 is not found
[[Node: prefix/PyFunc = PyFunc[Tin=[DT_FLOAT, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_INT32, DT_STRING], Tout=[DT_FLOAT], token="pyfunc_0", _device="/job:localhost/replica:0/task:0/device:CPU:0"](prefix/Relu, prefix/PyFunc/input_1, prefix/PyFunc/input_2, prefix/PyFunc/input_3, prefix/PyFunc/input_4, prefix/PyFunc/input_2, prefix/PyFunc/input_6, prefix/PyFunc/input_7)]]
Console error:
UserWarning: Using the pooling op in compatibility mode.This means your graph cannot be serialized.Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.
"correspond to Tensorflow SAME or VALID padding.", UserWarning)
PB model: https://drive.google.com/open?id=1gp1VF1lafDpxiqIUgVAgWvOeqVxoTAlh
@tjingrant @fumihwh Any ideas?
@asarah-github I test master version of resnet50 from https://github.com/onnx/models/tree/master/resnet50 and it works....
@achalshah20
As warning says Please configure your pooling operation to only use paddings that correspond to Tensorflow SAME or VALID padding.
.
For example, in pytorch, if you set [1, 3, 5, 5], kernel [3, 3], pads [1, 1, 1, 1], it corresponds to "SAME" in tf.
But if you set pads [2, 2, 2, 2], it doesn't work with default tf func. We should use compatibility mode and calculate pool result by manual. This is exactly what PyFunc is.
And PyFunc will be irreversible, means you can not convert this pb to onnx.
@fumihwh When you tested the master version of ResNet-50 did you do a master build of onnx-tf?