TensorRT [int8 quantization] rules for correct Q/DQ node placement with add & concat operations, unclear documentation

I want to achieve a full int8 model, with maximum speed optimizations in int8, but documentations is very unclear.

Fusion of nodes: Conv, Sigmoid, Mul, Add?

I know that: Conv, Sigmoid, Mul will be fused
Is it also possible to fuse Conv, Sigmoid, Mul, Add? If yes, how can I achieve this? This is unclear from doc. qdq-placement-recs

How should I handle Q/DQ nodes with concat?

Should all the inputs/outputs be Q/DQ nodes, anything else I should know?

How should I handle Q/DQ nodes with split?

- Should all the input/output be Q/DQ nodes, anything else I should know?

Why are the Scale & PointWise operations introduced in my graph?

for the two bottlenecks on the right there is some PointWise operation added for some strange reason, what is the reason for this, and why is it added?
Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?
In the second bottleneck on the left there is a scale added, but not on the first one, what is the reason for this? All these operations introduces some extra latency, is the an option to omit this?

scale_pointwise

scale_pointwise_onnx

Strange behavior first "bottleneck" layer

There are extra reformat layers added in the first bottleneck layer? and why are the operations not fused? As you can see the Q/DQ nodes are correctly placed...

Environment

TensorRT Version: 8.6.3

ONNXVersion: 1.15.0, optset 17

Relevant Files

full trt graph

full onnx graph

@ttyio

May 13 '24 17:05 Michelvl92

Is it also possible to fuse Conv, Sigmoid, Mul, Add?

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

How should I handle Q/DQ nodes with concat? How should I handle Q/DQ nodes with split?

The Q node commutes with concat/slice. usually we don't need handle their input specially.

for the two bottlenecks on the right there is some PointWise operation added for some strange reason.

Both of the attached file are screenshot of the onnx file, ont trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Furthermore from the first conv to the second conv there is a scale introduced, what is the reason for this?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

May 14 '24 05:05 ttyio

Could you try remove the Q/DQ between mul and add. leave the Q/DQ on the other branch of the add.

Tried this, but it is not getting fused.

The Q node commutes with concat/slice. usually we don't need handle their input specially.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

Because your onnx has pattern Conv -> Q/DQ -> split -> Q/DQ -> Conv, it fused as Conv(INT8-OUT) -> DQ+Q -> Conv(INT8-IN), The dangling DQ+Q in the middle created a scale layer. could you try remove the Q/DQ pair after the first Conv?

So as you can see, from the pointwise on the left to the last concat/conv there is again a scale introduced, what is the reason for this? how is this fused? I assumed that you need to add a Q/DQ node after the add operation to have a int8 output, but still there is a scale introduced, how can i avoid this? Or is this again because of the the Q/DQ before the split.

Both of the attached file are screenshot of the onnx file, and trt graph. The snapshot is fuzzy, I am not sure what those pointwise are. Could you upgrade your TRT to latest 10.0 to see if anything changed?

Both onnx and engine graphs were added already in the first post, see relevant files. I cannot try TRT10 cuz 8.6 is the latest version supported by the latest jetpack version for jetson ago print.

May 14 '24 06:05 Michelvl92

Adding @nzmora for vis.

Both onnx and engine graphs were added already in the first post

When I open your attachment, both image are the same.

So does this mean that no Q/DQ node is required as input? Is the Q/DQ node only required for computational operations?

No Q/DQ required for concat/slide since they commutes with Q/DQ.

what is the reason for this? how is this fused?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

May 22 '24 16:05 ttyio

@ttyio, just a small update from my side (with updated onnx and engine graphs included).

As you can see from the graph most of (double Q/DQ0 operations) are fixed and or properly placed, which resulted in a reduction in reformat layers, and a model that is in my opinion fully int8.

But there are 3 strange things happening in the model which I do not understand:

In the skip connections scale operations are introduced (which where not there). They take up to 14% of the latency budget. Is there a possibility to reduce this, or is this correct?

I am not sure if this is 8.6 only issue. Have you tried run your model on TRT 10.0 with desktop GPU?

I need to deploy models on Jetson systems, which only support TRT8.6. How will testing on RT 10.0 help you? Since it will be not possible to deploy the model on a Jetson if I am correct?

In the ONNX graph there is 1 skip connections, but in the engine graph there are two, what is the reason for this? It does not seem to harm model accuracy, but they both introduce a scale operation which introduces extra latency.
For some strange reason reformat in a slightly bigger model, but with the same architecture, in the first bottleneck layer reformat operations are introduced (but still everything is 8int), which are not introduced in the smaller model. Looking at the ONNX graph i do not see anything strange, what is the reason for this, am I doing something wrong?

model graph onnx model graph TRT

May 29 '24 13:05 Michelvl92

@Michelvl92 Yours look like a YOLOv8 model. I'm having similar issues trying to quantize YOLOv8.

I'm trying to fuse Conv, BN, Sigmoid, Mul, Add into a single block, similar to this fusion in the image below. But I could not achieve this despite having QDQ in the right place. expected result actual result My assumption is that this fusion is only possible with Conv, Bn, Add. Introducing any activation function (act) in the middle will cause it to fuse into (Conv, Bn), (act, Add).
There's also a weird problem causing a normal conv, bn, silu layer to not fuse, instead they decide to split into two block and use FP32 without any reason. Only happens to this exact layer, the rest of the network is fine. conv, bn, silu not fuse

Both TensorRT 8.5 and 10.6 give the same result.

Any help is appreciated.

Onnx graph Engine graph

UPDATE

About Conv, BN, SiLU, Add fusion. I discovered that it is possible to fuse these into a single conv layer if you ad QDQ in between every node. Though I'm unsure about it's impact on accuracy as it is not recommended to place QDQ right after weighted node in qdq-placement-recs. Conv, BN, SiLU, Add QDQ placement for fusion Result This works on TRT 10.6 while 8.5 gives error as no tactic is found.

Nov 16 '24 03:11 Nasmes