TensorRT
TensorRT copied to clipboard
layernorm fp16, The accuracy is still insufficient
GPU: 4090
[06/18/2025-03:48:34] [I] TensorRT version: 10.10.0
[06/18/2025-03:48:34] [I] Loading standard plugins
[06/18/2025-03:48:34] [I] [TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 26, GPU 390 (MiB)
[06/18/2025-03:48:37] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1749, GPU +8, now: CPU 1977, GPU 398 (MiB)
[06/18/2025-03:48:37] [I] Start parsing network model.
[06/18/2025-03:48:37] [I] [TRT] ----------------------------------------------------------------
[06/18/2025-03:48:37] [I] [TRT] Input filename: model.onnx
[06/18/2025-03:48:37] [I] [TRT] ONNX IR version: 0.0.8
[06/18/2025-03:48:37] [I] [TRT] Opset version: 17
[06/18/2025-03:48:37] [I] [TRT] Producer name: pytorch
[06/18/2025-03:48:37] [I] [TRT] Producer version: 2.7.1
[06/18/2025-03:48:37] [I] [TRT] Domain:
[06/18/2025-03:48:37] [I] [TRT] Model version: 0
[06/18/2025-03:48:37] [I] [TRT] Doc string:
[06/18/2025-03:48:37] [I] [TRT] ----------------------------------------------------------------
[06/18/2025-03:48:37] [W] [TRT] ModelImporter.cpp:503: Make sure input input_ids has Int64 binding.
[06/18/2025-03:48:37] [W] [TRT] ModelImporter.cpp:503: Make sure input attention_mask has Int64 binding.
[06/18/2025-03:48:39] [I] Finished parsing network model. Parse time: 2.17323
[06/18/2025-03:48:39] [I] Set shape of input tensor input_ids for optimization profile 0 to: MIN=1x1 OPT=8x100 MAX=128x1024
[06/18/2025-03:48:39] [I] Set shape of input tensor attention_mask for optimization profile 0 to: MIN=1x1 OPT=8x100 MAX=128x1024
[06/18/2025-03:48:42] [W] [TRT] Detected layernorm nodes in FP16.
[06/18/2025-03:48:42] [W] [TRT] Running layernorm after self-attention with FP16 Reduce or Pow may cause overflow. Forcing Reduce or Pow
Layers in FP32 precision, or exporting the model to use INormalizationLayer (available with ONNX opset >= 17) can help preserving accur
acy.
Reproduction steps
/usr/local/lib/python3.**/dist-packages/optimum/exporters/onnx/convert.py(467) def export_pytorch() Set opset=17
from sentence_transformers import (
SentenceTransformer,
)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B", backend="onnx")
model.save_pretrained("export")
/usr/src/tensorrt/bin/trtexec --minShapes=input_ids:1x1,attention_mask:1x1 --optShapes=input_ids:8x100,attention_mask:8x100 --maxShapes=input_ids:128x1024,attention_mask:128x1024 --onnx=model.onnx --saveEngine=./model.plan --fp16