returnn
returnn copied to clipboard
Onnx: ceildiv yields wrong result during conversion
We had one particular problem when converting a Conformer acoustic model from TensorFlow to ONNX:
Calculating the sequence lengths after convolution resulted in wrong calculations on ONNX side.
The issue is the ceildiv
operation which is currently defined as:
ceildiv(a, b) := -floordiv(-a, b)
In order for conversion to work properly, i suggest to change it to:
ceildiv'(a, b) := (a + b - 1) / b
which gave the correct results. Thoughts?
I don't understand. What is wrong about the first version?
I guess mathematically nothing but after conversion it looked like ONNX was doing a floordiv(a, b)
on the values instead. When changing the formula it was working as intended.
We should better understand this before just making such arbitrary change. Is this a bug in ONNX then? Did you report it? Can you link the issue?
Note that your proposed alternative variant has some disadvantages:
- It is incorrect for negative
b
. (Maybe never relevant for our cases but I'm not so happy about having a partly incorrect implementation.) - It might be slower. (There was some benchmark on pure Python for this. Not sure if this holds true for TensorFlow, ONNX or PyTorch.)
You find some related posts on StackOverflow, for example here.
Thanks for your comments. I haven't reported this yet.
I noticed that this occurs for integer inputs only. For floating point types this works as expected. Running the following program
import tensorflow as tf
x = [3, 7, 10]
class CustomModule(tf.Module):
@tf.function(input_signature=[tf.TensorSpec([None], tf.int32), tf.TensorSpec([], tf.int32)])
def __call__(self, a, b):
return -tf.math.floordiv(-a, b)
ceildiv = CustomModule()
# correctly produces [1, 3, 4]
print(ceildiv(x, 3).numpy())
ceildiv(x, 3)
tf.saved_model.save(ceildiv, "saved-model-example")
then exporting it to onnx with
python3 -m tf2onnx.convert --saved-model saved-model-example/ --output example.onnx
and running it with
import numpy as np
import onnxruntime as ort
ort_sess = ort.InferenceSession('example.onnx')
x = np.array([3, 7, 10]).astype(np.int32)
y = np.array([3]).astype(np.int32)
print(ort_sess.run(None, {'a': x, 'b': y}))
yields the observed behaviour. So it isn't necessary to change the formula itself i would say.
I don't exactly understand your last comment. In that example, you have integers only, not float, or not?
What is some minimal code which behaves wrong? Or by "observed behaviour", you actually mean the wrong behavior? What is the output of that code you posted?
And can you report this to ONNX then, and link the issue here? This sounds like a serious bug?
Maybe it's a bug of the tf2onnx.convert script, and not ONNX itself? In that case, you probably should report the bug in the tensorflow-onnx repo.
Can you inspect the resulting ONNX graph? Can you post that here?
I don't exactly understand your last comment. In that example, you have integers only, not float, or not?
I tried this for floats too, which yields the correct behvaiour (outputs you would expect from ceildiv). Outputs of the run are below.
What is some minimal code which behaves wrong? Or by "observed behaviour", you actually mean the wrong behavior? What is the output of that code you posted?
With observed behaviour i mean the wrong one. The output would be
First Script: [1, 3, 4]
Second Script: [1, 2, 3]
Second Script floats: [1.0, 3.0, 4.0]
And can you report this to ONNX then, and link the issue here? This sounds like a serious bug? Maybe it's a bug of the tf2onnx.convert script, and not ONNX itself? In that case, you probably should report the bug in the tensorflow-onnx repo.
Yes, i can report this.
Can you inspect the resulting ONNX graph? Can you post that here?
The model that is converted for integers:
The model that is converted for floating points:
So it looks like Div
is behaving inconsistent to Python/TensorFlow for negative integer numbers? I'm not sure if this is supposed to be like that or not. If it is supposed to be like that, then this is a bug in tf2onnx, which should have worked around that. If it is not supposed to be like that, then it is a bug in ONNX itself, or maybe in the specific ONNX backend you are using.
Here is the issue in tf2onnx: https://github.com/onnx/tensorflow-onnx/issues/2174
Here is the issue in tf2onnx: onnx/tensorflow-onnx#2174
A possible workaround for us:
def onnx_compat_floor_div(a, b):
# https://github.com/onnx/tensorflow-onnx/issues/2174
abs_a, abs_b = tf.abs(a), tf.abs(b)
return tf.where(
a * b >= 0,
a // b,
-abs_a // abs_b - tf.cast(abs_a % abs_b != 0, dtype=a.dtype)
)
We should only use this function if needed though, i.e. you could introduce a new special flag onnx_export
or so, and if this is true, use this onnx_compat_floor_div
. Similarly as our get_global_inf_value
function, maybe some function is_onnx_export_global()
or so.
Why did you reopen? Just to keep track about this until it is fixed upstream? Or is sth supposed to be done on RETURNN side?
I have some conversion issues using the current returnn upstream. Specifically, a conformer model using relative positional encoding with sub-sampling. The same model has been converted to onnx
a couple months ago for which i used a different workaround.
It looks like this is related to ceildiv
and/or relative positional encoding. I'm not sure yet. Currently investigating. onnxruntime
complains about incompatible shapes for matrix multiplication which i suspect in the rel. pos. encoding sub-graph but the error message is not really helpful yet. I'm going to build a new version for debugging.
Everything works fine when the maximum sequence length is not divisible by 3, the sub-sampling factor.
I reopened a little bit too early because i thought i found the reason but that turned out to be wrong.