returnn icon indicating copy to clipboard operation
returnn copied to clipboard

Onnx: ceildiv yields wrong result during conversion

Open Gerstenberger opened this issue 1 year ago • 12 comments

We had one particular problem when converting a Conformer acoustic model from TensorFlow to ONNX: Calculating the sequence lengths after convolution resulted in wrong calculations on ONNX side. The issue is the ceildiv operation which is currently defined as:

ceildiv(a, b) := -floordiv(-a, b)

In order for conversion to work properly, i suggest to change it to:

ceildiv'(a, b) := (a + b - 1) / b

which gave the correct results. Thoughts?

Gerstenberger avatar May 16 '23 22:05 Gerstenberger

I don't understand. What is wrong about the first version?

albertz avatar May 16 '23 22:05 albertz

I guess mathematically nothing but after conversion it looked like ONNX was doing a floordiv(a, b) on the values instead. When changing the formula it was working as intended.

Gerstenberger avatar May 16 '23 22:05 Gerstenberger

We should better understand this before just making such arbitrary change. Is this a bug in ONNX then? Did you report it? Can you link the issue?

Note that your proposed alternative variant has some disadvantages:

  • It is incorrect for negative b. (Maybe never relevant for our cases but I'm not so happy about having a partly incorrect implementation.)
  • It might be slower. (There was some benchmark on pure Python for this. Not sure if this holds true for TensorFlow, ONNX or PyTorch.)

You find some related posts on StackOverflow, for example here.

albertz avatar May 16 '23 22:05 albertz

Thanks for your comments. I haven't reported this yet.

I noticed that this occurs for integer inputs only. For floating point types this works as expected. Running the following program

import tensorflow as tf

x = [3, 7, 10]

class CustomModule(tf.Module):

    @tf.function(input_signature=[tf.TensorSpec([None], tf.int32), tf.TensorSpec([], tf.int32)])
    def __call__(self, a, b):
        return -tf.math.floordiv(-a, b)

ceildiv = CustomModule()

# correctly produces [1, 3, 4]
print(ceildiv(x, 3).numpy())

ceildiv(x, 3)
tf.saved_model.save(ceildiv, "saved-model-example")

then exporting it to onnx with

python3 -m tf2onnx.convert --saved-model saved-model-example/ --output example.onnx

and running it with

import numpy as np
import onnxruntime as ort

ort_sess = ort.InferenceSession('example.onnx')

x = np.array([3, 7, 10]).astype(np.int32)
y = np.array([3]).astype(np.int32)

print(ort_sess.run(None, {'a': x, 'b': y}))

yields the observed behaviour. So it isn't necessary to change the formula itself i would say.

Gerstenberger avatar May 17 '23 00:05 Gerstenberger

I don't exactly understand your last comment. In that example, you have integers only, not float, or not?

What is some minimal code which behaves wrong? Or by "observed behaviour", you actually mean the wrong behavior? What is the output of that code you posted?

And can you report this to ONNX then, and link the issue here? This sounds like a serious bug?

albertz avatar May 17 '23 07:05 albertz

Maybe it's a bug of the tf2onnx.convert script, and not ONNX itself? In that case, you probably should report the bug in the tensorflow-onnx repo.

Can you inspect the resulting ONNX graph? Can you post that here?

albertz avatar May 17 '23 07:05 albertz

I don't exactly understand your last comment. In that example, you have integers only, not float, or not?

I tried this for floats too, which yields the correct behvaiour (outputs you would expect from ceildiv). Outputs of the run are below.

What is some minimal code which behaves wrong? Or by "observed behaviour", you actually mean the wrong behavior? What is the output of that code you posted?

With observed behaviour i mean the wrong one. The output would be

First Script: [1, 3, 4]
Second Script: [1, 2, 3]
Second Script floats: [1.0, 3.0, 4.0]

And can you report this to ONNX then, and link the issue here? This sounds like a serious bug? Maybe it's a bug of the tf2onnx.convert script, and not ONNX itself? In that case, you probably should report the bug in the tensorflow-onnx repo.

Yes, i can report this.

Can you inspect the resulting ONNX graph? Can you post that here?

The model that is converted for integers: int_model

The model that is converted for floating points: float_model

Gerstenberger avatar May 17 '23 07:05 Gerstenberger

So it looks like Div is behaving inconsistent to Python/TensorFlow for negative integer numbers? I'm not sure if this is supposed to be like that or not. If it is supposed to be like that, then this is a bug in tf2onnx, which should have worked around that. If it is not supposed to be like that, then it is a bug in ONNX itself, or maybe in the specific ONNX backend you are using.

albertz avatar May 17 '23 07:05 albertz

Here is the issue in tf2onnx: https://github.com/onnx/tensorflow-onnx/issues/2174

Gerstenberger avatar May 26 '23 06:05 Gerstenberger

Here is the issue in tf2onnx: onnx/tensorflow-onnx#2174

A possible workaround for us:

def onnx_compat_floor_div(a, b):
  # https://github.com/onnx/tensorflow-onnx/issues/2174
  abs_a, abs_b = tf.abs(a), tf.abs(b)
  return tf.where(
    a * b >= 0,
    a // b,
    -abs_a // abs_b - tf.cast(abs_a % abs_b != 0, dtype=a.dtype)
  )

We should only use this function if needed though, i.e. you could introduce a new special flag onnx_export or so, and if this is true, use this onnx_compat_floor_div. Similarly as our get_global_inf_value function, maybe some function is_onnx_export_global() or so.

albertz avatar May 26 '23 12:05 albertz

Why did you reopen? Just to keep track about this until it is fixed upstream? Or is sth supposed to be done on RETURNN side?

albertz avatar Dec 14 '23 15:12 albertz

I have some conversion issues using the current returnn upstream. Specifically, a conformer model using relative positional encoding with sub-sampling. The same model has been converted to onnx a couple months ago for which i used a different workaround.

It looks like this is related to ceildiv and/or relative positional encoding. I'm not sure yet. Currently investigating. onnxruntime complains about incompatible shapes for matrix multiplication which i suspect in the rel. pos. encoding sub-graph but the error message is not really helpful yet. I'm going to build a new version for debugging.

Everything works fine when the maximum sequence length is not divisible by 3, the sub-sampling factor.

I reopened a little bit too early because i thought i found the reason but that turned out to be wrong.

Gerstenberger avatar Dec 14 '23 16:12 Gerstenberger