TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

set fp32 for MatMul failure of TensorRT 8.6.1.1 when converting onnx to trt with version 8.6.1.1

Open RobinsonKO opened this issue 1 year ago • 3 comments

Description

onnx: https://drive.google.com/file/d/1JgwgwIl71BnJRw2e9FtgV0DGSGzLy0OZ/view?usp=sharing

I tried to use fp32 for MatMul in self-attention and cross-attention layer on A100, but it seems not work.

  • I inserted Cast layer around MatMul op. Did I use it wrong?

input output

In A100, reform the softmax position of onnx's self-attention into output, which can be aligned with the torch KPI, but the same method is not aligned with the torch KPI on orin. (orin KPI dropped by 30%)

  • On orin, the softmax result of the last layer of cross-attention needs to be set as the output to align with the KPI of the torch. this is so weird? O

Environment

A100 TensorRT Version: 8.6.1.1 NVIDIA GPU: A100 NVIDIA Driver Version: 470.199.02 CUDA Version: 11.4 Operating System: Linux Python Version (if applicable): 3.8.10 PyTorch Version (if applicable): 1.10.0

Orin TensorRT Version: 8.6.1.1

### Tasks

RobinsonKO avatar Aug 05 '24 16:08 RobinsonKO

Have you tried to use the ToT version? (e.g. 10.2) I think because you are using weakly_typed networks and I assume you enabled fp16 flags, you will need to set the precision for the specific layer and set a TRT flag to obey precision to force the layers that you want to be executed in fp32.

In addition, it is possible that in Orin, MHA is not fused into a single kernel, which might also affect perf and accuracy, can you try collect some nsys profiles on both platforms to see if there are any kernel diffs?

nvluxiaoz avatar Aug 06 '24 06:08 nvluxiaoz

image For MHA precision, we want Matmul cast to fp32,which used to fp16. However, it doesn't work.

Wuqiman avatar Aug 06 '24 07:08 Wuqiman

For_ MHA precision, we want Matmul cast to fp32,which used to fp16. However, it doesn't work.

Try to mark this matmul as out and assign the fp32 output format, then rebuild with trtexec.

lix19937 avatar Aug 07 '24 01:08 lix19937

You can use the builder flag kSTRONGLY_TYPED to force TensorRT to respect the casts. See https://github.com/NVIDIA/TensorRT/blob/release/10.8/include/NvInfer.h#L10299 for more info.

kevinch-nv avatar Feb 12 '25 02:02 kevinch-nv

Closing due to inactive. Please feel free to reopen!

poweiw avatar May 29 '25 21:05 poweiw