onnx2tf pre_explicit_broadcast should not expend scalar tensor

trafficstars

Issue Type

Feature Request

OS

Linux

onnx2tf version number

1.19.7

onnx version number

1.15.0

onnxruntime version number

1.16.3

onnxsim (onnx_simplifier) version number

0.4.33

tensorflow version number

2.15.0

Download URL for ONNX

Parameter Replacement JSON

none

Description

The converted tflite graph cannot run through ArmNN with Arm Compute Library GPU (which is faster than tflite on native CPU.)
For ONNX nodes: Input -> Gemm -> Mul (with scalar) and use batch=1:

The converted tflite graph contains a fully connected layer with a bias of two dimensions while originally the bias is one dimension: Screen Shot 2024-01-17 at 11 47 55 PM

Since the new dimension is 1, mathematically the converted tflite graph is still correct. Unfortunately, Arm Compute Library currently errors out if the bias has more than one dimension: https://github.com/ARM-software/ComputeLibrary/blob/c2a79a4b8c51ce835eaf984f3a1370447b3282c4/src/cpu/operators/CpuFullyConnected.cpp#L431

It appears that the extra dimension is added at pre_explicit_broadcast https://github.com/PINTO0309/onnx2tf/blob/4e8f29129e6ea03f45f9d6520cc077522a790249/onnx2tf/utils/common_functions.py#L845:L858. Can this function be changed to do nothing if the tensor is a scalar?

Jan 18 '24 08:01 peiciwu

It is a pain in the ass to generate and test the ONNX myself for your request. Please provide me with an ONNX file, even a partial model. I know all I have to do is write a few lines of code in PyTorch. But even that is a pain in the ass.

Jan 18 '24 08:01 PINTO0309

For sure, here is an ONNX file to reproduce: https://mythic.box.com/s/wmwsrpd41vpe27pqv1k3c7z3lb2daxqr

Jan 19 '24 01:01 peiciwu

The tool has an option to optimize for GPU Delegate, but I noticed a bug in Gemm's implementation during testing, so I fixed it and released the tool. v1.19.8

https://github.com/PINTO0309/onnx2tf/releases/tag/1.19.8

Gemm
- Fixed Abort bug when converting models with --optimization_for_gpu_delegate or -ofgd.
  
  onnx before tflite after tflite

Jan 19 '24 06:01 PINTO0309

Thanks for the fix, Unfortunately, I don't want to use --optimization_for_gpu_delegate since it actually makes armNN run slower. Is it possible to have a fix of not using --optimization_for_gpu_delegate?

Jan 19 '24 17:01 peiciwu

I think you have misunderstood something, pre_explicit_broadcast is irrelevant. The value corresponding to bias (onnx:C) is already processed as a scalar inside onnx2tf. However, when the model is converted to tflite, it generates bias of [1,1000] on its own. The issue at stake in this issue is the C of ONNX, i.e., the bias. NNAPI has a long-standing problem with properly handling bias.

https://github.com/PINTO0309/onnx2tf/blob/6d0c3d828b1f461c11a8516600acbc9698fdbf10/onnx2tf/ops/Gemm.py#L194-L205

# cast to attribute data type
x = tf.cast(x, tf.float32)
y = tf.cast(y, tf.float32)
z = tf.cast(z, tf.float32)
if not optimization_for_gpu_delegate:
    if z is not None:
        result = tf.convert_to_tensor(alpha) * tf.matmul(x, y) + tf.convert_to_tensor(beta) * z #<----- here
    else:
        result = tf.convert_to_tensor(alpha) * tf.matmul(x, y) + tf.convert_to_tensor(beta)

else:
    result = tf.convert_to_tensor(alpha) * tf.matmul(x, y) - (tf.convert_to_tensor(beta) * z) * tf.convert_to_tensor(-1.0, dtype=z.dtype)

It is a debug printout that proves the conversion as a scalar.

beta
1.0

# onnx:C TFLie: z
z.shape
TensorShape([1000])

(tf.convert_to_tensor(beta) * z).shape
TensorShape([1000])

I know from the beginning that GPU Delegate cannot handle bias in more than 2 dimensions. However, there is a bug in tflite converter that does not produce scalar bias. So I implemented a workaround that I don't like: the GPU optimization option.

You should issue an issue to the NNAPI repository or the TensorFlow repository. Btw, I have already posted that discussion on a TensorFlow issue over a year ago, but it has been ignored.

Good luck.

Jan 20 '24 05:01 PINTO0309

I ran an ONNX model that does not have a Mul node followed by Gemm. The converted tflite graph has FullyConnected node with bias of one dimension as expected though. Here is the onnx graph: https://mythic.box.com/s/e3n6a0grvtvhgf5x37ve4c568n87ilnq

At my original ONNX graph, if I made https://github.com/PINTO0309/onnx2tf/blob/4e8f29129e6ea03f45f9d6520cc077522a790249/onnx2tf/utils/common_functions.py#L845:L858 to not extend the tensor if the input is a scalar, then the converted tflite graph would fuse the Mul into the fully connected layer and the layer has bias of one dimension as expected as well.

Jan 22 '24 20:01 peiciwu

It's too strange... It will take some time to do a little research. I can only guess at this point, but it is possible that the tflite optimizer is doing something when it optimizes the next Mul, rather than a Gemm conversion issue.

Somehow, I could roughly guess what the optimizer was doing.

repro_new_onnx_model_v2.onnx.zip

repro_new_onnx_model_v2_1.onnx.zip

repro_new_onnx_model_v2_1_float32.tflite.zip

PS:

Batch size ? but only when the batch size is changed to a fixed number such as 1.
- Without -b option
```
onnx2tf -i repro_new_onnx_model_v2.onnx
```
- With -b option
```
onnx2tf -i repro_new_onnx_model_v2.onnx -b 1
```

Jan 22 '24 22:01 PINTO0309

Try it.

Fixes: https://github.com/PINTO0309/onnx2tf/releases/tag/1.19.10

Jan 23 '24 01:01 PINTO0309

Thanks for the quick fix. The fix won't work if the graph has a series of Mul or Div nodes followed by Gemm; both Mul and Div would be fused into Gemm and then would make bias of two dimensions.

Here is an ONNX file with Gemm -> Mul -> Div: https://mythic.box.com/s/i6ginfybruy1ibeg5pmzluxh6vgj9ye3

Jan 23 '24 06:01 peiciwu

I understood. However, I am skeptical that we really need to automatically support that pattern in onnx2tf. I understand that a clean conversion would be great for everyone, but the two consecutive scalar Mul and Div should be able to be pre-fused before exporting to ONNX.

I don't know how you are generating your model, but wouldn't you just calculate 1.9921875 / 1.072122573852539 = 1.858171396 in advance?

Jan 23 '24 06:01 PINTO0309

This is a ONNX model which relies on ONNXRuntime to do graph optimization (including fusion) but ONNXRuntime doesn't export the optimized model as their optimization is done in memory only. I might be able to trace its pytorch model and doing the fusion there where the operators are separated for different quantization needs per hardware backends, although it becomes irrelevant for float32 in the final tflite model.

Is pre_explicit_broadcast extending the scalar tensor absolutely needed? explicit_broadcast does not broadcast the scalar tensor, so why pre_explicit_broadcast needs to do that? It's an easier change compared to pattern matching change.

Jan 23 '24 09:01 peiciwu

why pre_explicit_broadcast needs to do that? It's an easier change compared to pattern matching change.

The pre_explicit_broadcast is not necessary if you want to adapt it to the convenience of your model only. I have already tested about 600 different model conversions, including a more complex 150,000 OP model. The dimensional conversion is not yet perfect, but so far I have determined that extending the dimension and then transposing and broadcasting is the best means of reducing the probability of error. The problem is not as simple as you think.

I'll look into the Mul -> Div, Mul -> Mul, Div -> Mul -> Div -> Mul, Mul -> Add, Add -> Mul, ... solution in a bit.

Idea note: search for all consecutive scalar operations behind Gemm, doesn't matter if it's 2 or a million, just search for the next scalar operation after Gemm. However, if the output has two branches, the process becomes too complicated, so the search ends there.

Jan 23 '24 09:01 PINTO0309

I think the scalar consecutive pattern searching method would unfortunately have some complexity.

I looked at the code change: https://github.com/PINTO0309/onnx2tf/commit/621bd27c332df916bcc625b2c801bed7822ce62e And it seems the original intent is not for scalar tensor, so perhaps a workaround is to have an extra user-specified option to not extend the tensor for Mul/Div nodes etc. But again, it's a user defined option which I understand it adds extra maintenance liability.

I'm fine to close the issue as is, and I will try to clean the graph on my end.

Jan 23 '24 18:01 peiciwu

Just wanted to add that I also have an issue with multidimensional bias with this model.

ONNX:

bias_onnx

TFLite:

bias_tflite

Extrapolating from #577, I was able to change if (is_scalar_1 or is_scalar_2) and graph_node.i().op == 'Gemm': to if (is_scalar_1 or is_scalar_2) and graph_node.i().op == 'Gemm' or graph_node.o().op == 'Sigmoid': to produce 1D bias. This is just a hack, of course.

bias_tflite_1d

Jul 09 '24 00:07 mertalev

There are so many patterns to deal with that it seems quite difficult to make a mechanical decision. (I understand that it is a hack.)

Jul 09 '24 01:07 PINTO0309

I think it'd be almost as good to let the user specify the nodes they want to skip expansion. It could be the actual names of the nodes, or to be more powerful, it could accept a lambda that accepts a node and returns a boolean. Getting it right automatically every time is difficult, but it's much easier for a particular model.

Jul 09 '24 04:07 mertalev

I see. That might be a good idea.

Jul 09 '24 04:07 PINTO0309

Idea note If no change occurs in the shape of the input tensor and the output tensor after the operation, skip the broadcast regardless of whether the previous operation was Gemm or not. It should be effective no matter how many arithmetic operations on scalar values are performed in succession.
- repro_new_onnx_model_v2_2.onnx repro_new_onnx_model_v2_2.onnx.zip
  - Without -b option
    
    ONNX TFLite
  - With -b 1 option
    
    ONNX TFLite
- ViT-B-16__openai_partial.onnx ViT-B-16__openai_partial_cut.onnx.zip
  
  ONNX TFLite