TensorRT Converting engine file from onnx file with ReduceMax failure of TensorRT 8.5.10 when running trtexec on GPU Orin

Description

I tried to generate engine file from onnx file on Orin GPU, but it failed: [05/15/2024-11:45:16] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB) [05/15/2024-11:45:16] [E] Saving engine to file failed. [05/15/2024-11:45:16] [E] Engine set up failed

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

May 15 '24 05:05 JYS997760473

Please add --verbose to get more detailed log.

May 16 '24 02:05 lix19937

Please add --verbose to get more detailed log.

Hi, I replaced the original nn.Layernorm block by nn.BatchNormalization block. Now my new network onnx file is : According to the docucment: "https://github.com/NVIDIA/Deep-Learning-Accelerator-SW/tree/main/operators", BatchNormalizaion operator is supported native by Nvidia DLA, but when I try to generate engine file from the onnx file, I still failed. The end part of log is here:

[05/15/2024-20:44:50] [V] [TRT] Layer: MaxPool_5 Host Persistent: 1408 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_12 Host Persistent: 6752 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_13 || Gemm_14 Host Persistent: 5664 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_15 Host Persistent: 6752 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17) Host Persistent: 244 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_19 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_20 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Layer: Gemm_21 Host Persistent: 6048 Device Persistent: 0 Scratch Memory: 0
[05/15/2024-20:44:50] [V] [TRT] Skipped printing memory information for 22 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
[05/15/2024-20:44:50] [I] [TRT] Total Host Persistent Memory: 45280
[05/15/2024-20:44:50] [I] [TRT] Total Device Persistent Memory: 0
[05/15/2024-20:44:50] [I] [TRT] Total Scratch Memory: 0
[05/15/2024-20:44:50] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 132 MiB
[05/15/2024-20:44:50] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 29 steps to complete.
[05/15/2024-20:44:50] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 0.337024ms to assign 7 blocks to 29 nodes requiring 126464 bytes.
[05/15/2024-20:44:50] [V] [TRT] Total number of blocks in optimized block assignment: 7
[05/15/2024-20:44:50] [I] [TRT] Total Activation Memory: 126464
[05/15/2024-20:44:50] [V] [TRT] Finalize: MatMul_0 Set kernel index: 0
[05/15/2024-20:44:50] [V] [TRT] Finalize: MaxPool_5 Set kernel index: 1
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_12 Set kernel index: 2
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_13 || Gemm_14 Set kernel index: 3
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_15 Set kernel index: 2
[05/15/2024-20:44:50] [V] [TRT] Finalize: PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17) Set kernel index: 4
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_19 Set kernel index: 5
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_20 Set kernel index: 6
[05/15/2024-20:44:50] [V] [TRT] Finalize: Gemm_21 Set kernel index: 6
[05/15/2024-20:44:50] [V] [TRT] Total number of generated kernels selected for the engine: 7
[05/15/2024-20:44:50] [V] [TRT] Kernel: 0 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 1 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 2 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 3 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 4 TRT_SERIALIZABLE:generatedNativePointwise
[05/15/2024-20:44:50] [V] [TRT] Kernel: 5 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Kernel: 6 CASK_STATIC
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: CUDNN
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: CUBLAS, CUBLAS_LT
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
[05/15/2024-20:44:50] [V] [TRT] Disabling unused tactic source: JIT_CONVOLUTIONS
[05/15/2024-20:44:50] [V] [TRT] Engine generation completed in 10.7422 seconds.
[05/15/2024-20:44:50] [V] [TRT] Deleting timing cache: 141 entries, served 42 hits since creation.
[05/15/2024-20:44:50] [V] [TRT] Engine Layer Information:
Layer(NoOp): reshape_before_MatMul_0, Tactic: 0x0000000000000000, x (Float[12,20,12]) -> reshape_before_MatMul_0_out_tensor (Float[240,12,1,1])
Layer(NoOp): Reformatting CopyNode for Input Tensor 0 to MatMul_0, Tactic: 0x0000000000000000, reshape_before_MatMul_0_out_tensor (Float[240,12,1,1]) -> Reformatted Input Tensor 0 to MatMul_0 (Float[240,12:4,1,1])
Layer(CaskGemmConvolution): MatMul_0, Tactic: 0x00000000000201d1, Reformatted Input Tensor 0 to MatMul_0 (Float[240,12:4,1,1]) -> MatMul_0_out_tensor (Float[240,64:4,1,1])
Layer(NoOp): Reformatting CopyNode for Input Tensor 0 to reshape_after_MatMul_0, Tactic: 0x0000000000000000, MatMul_0_out_tensor (Float[240,64:4,1,1]) -> Reformatted Input Tensor 0 to reshape_after_MatMul_0 (Float[240,64,1,1])
Layer(NoOp): reshape_after_MatMul_0, Tactic: 0x0000000000000000, Reformatted Input Tensor 0 to reshape_after_MatMul_0 (Float[240,64,1,1]) -> onnx::Add_25 (Float[12,20,64])
Layer(Constant): backbone.subgraph.linear.bias + (Unnamed Layer* 4) [Shuffle], Tactic: 0x0000000000000000,  -> (Unnamed Layer* 4) [Shuffle]_output (Float[1,1,64])
Layer(ElementWise): Add_1, Tactic: 0x0000000000000001, (Unnamed Layer* 4) [Shuffle]_output (Float[1,1,64]), onnx::Add_25 (Float[12,20,64]) -> input (Float[12,20,64])
Layer(NoOp): (Unnamed Layer* 6) [Shuffle], Tactic: 0x0000000000000000, input (Float[12,20,64]) -> (Unnamed Layer* 6) [Shuffle]_output (Float[12,20,64,1])
Layer(Scale): BatchNormalization_2 + Relu_3, Tactic: 0x0000000000000000, (Unnamed Layer* 6) [Shuffle]_output (Float[12,20,64,1]) -> Relu_3_out_tensor (Float[12,20,64,1])
Layer(NoOp): squeeze_after_Relu_3, Tactic: 0x0000000000000000, Relu_3_out_tensor (Float[12,20,64,1]) -> squeeze_after_Relu_3_out_tensor (Float[12,20,64])
Layer(Shuffle): Transpose_4 + (Unnamed Layer* 11) [Shuffle], Tactic: 0x0000000000000000, squeeze_after_Relu_3_out_tensor (Float[12,20,64]) -> (Unnamed Layer* 11) [Shuffle]_output (Float[12,64,20,1])
Layer(CaskPooling): MaxPool_5, Tactic: 0x5faf4a0a8a5670ed, (Unnamed Layer* 11) [Shuffle]_output (Float[12,64,20,1]) -> (Unnamed Layer* 12) [Pooling]_output (Float[12,64,1,1])
Layer(NoOp): (Unnamed Layer* 13) [Shuffle] + Squeeze_6, Tactic: 0x0000000000000000, (Unnamed Layer* 12) [Pooling]_output (Float[12,64,1,1]) -> x.1 (Float[12,64])
Layer(Reformat): reshape_before_Gemm_12_copy_input, Tactic: 0x00000000000003e8, x.1 (Float[1,64]) -> reshape_before_Gemm_12_copy_input (Float[1,64])
Layer(NoOp): reshape_before_Gemm_12, Tactic: 0x0000000000000000, reshape_before_Gemm_12_copy_input (Float[1,64]) -> reshape_before_Gemm_12_out_tensor (Float[1,64,1,1])
Layer(CaskGemmConvolution): Gemm_12, Tactic: 0x000000000002034f, reshape_before_Gemm_12_out_tensor (Float[1,64,1,1]) -> Gemm_12_out_tensor (Float[1,32,1,1])
Layer(NoOp): reshape_after_Gemm_12, Tactic: 0x0000000000000000, Gemm_12_out_tensor (Float[1,32,1,1]) -> onnx::Gemm_37 (Float[1,32])
Layer(NoOp): reshape_before_Gemm_13, Tactic: 0x0000000000000000, x.1 (Float[12,64]) -> reshape_before_Gemm_13_out_tensor (Float[12,64,1,1])
Layer(CaskGemmConvolution): Gemm_13 || Gemm_14, Tactic: 0x00000000000204df, reshape_before_Gemm_13_out_tensor (Float[12,64,1,1]) -> Gemm_13 || Gemm_14 (Float[12,64,1,1])
Layer(Reformat): reshape_after_Gemm_13_copy_input, Tactic: 0x00000000000003e8, Gemm_13 || Gemm_14 (Float[12,32,1,1]) -> reshape_after_Gemm_13_copy_input (Float[12,32,1,1])
Layer(NoOp): reshape_after_Gemm_13, Tactic: 0x0000000000000000, reshape_after_Gemm_13_copy_input (Float[12,32,1,1]) -> onnx::Gemm_38 (Float[12,32])
Layer(Reformat): reshape_after_Gemm_14_copy_input, Tactic: 0x00000000000003e8, Gemm_13 || Gemm_14 (Float[12,32,1,1]) -> reshape_after_Gemm_14_copy_input (Float[12,32,1,1])
Layer(NoOp): reshape_after_Gemm_14, Tactic: 0x0000000000000000, reshape_after_Gemm_14_copy_input (Float[12,32,1,1]) -> onnx::Gemm_39 (Float[12,32])
Layer(CaskGemmMatrixMultiply): Gemm_15, Tactic: 0x000000000002034f, onnx::Gemm_37 (Float[1,32]), onnx::Gemm_38 (Float[12,32]) -> onnx::Div_40 (Float[1,12])
Layer(PointWiseV2): PWN(onnx::Div_41 + (Unnamed Layer* 33) [Shuffle], Div_17), Tactic: 0x000000000000001c, onnx::Div_40 (Float[1,12]) -> scores (Float[1,12])
Layer(CudaSoftMax): Softmax_18, Tactic: 0x00000000000003e9, scores (Float[1,12]) -> (Unnamed Layer* 36) [Softmax]_output (Float[1,12])
Layer(CaskGemmMatrixMultiply): Gemm_19, Tactic: 0x00000000000203be, (Unnamed Layer* 36) [Softmax]_output (Float[1,12]), onnx::Gemm_39 (Float[12,32]) -> onnx::Gemm_44 (Float[1,32])
Layer(NoOp): reshape_before_Gemm_20, Tactic: 0x0000000000000000, onnx::Gemm_44 (Float[1,32]) -> reshape_before_Gemm_20_out_tensor (Float[1,32,1,1])
Layer(CaskGemmConvolution): Gemm_20, Tactic: 0x000000000002014b, reshape_before_Gemm_20_out_tensor (Float[1,32,1,1]) -> Gemm_20_out_tensor (Float[1,32,1,1])
Layer(CaskGemmConvolution): Gemm_21, Tactic: 0x000000000002014b, Gemm_20_out_tensor (Float[1,32,1,1]) -> Gemm_21_out_tensor (Float[1,30,1,1])
Layer(NoOp): reshape_after_Gemm_21, Tactic: 0x0000000000000000, Gemm_21_out_tensor (Float[1,30,1,1]) -> reg (Float[1,30])
[05/15/2024-20:44:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +4, now: CPU 0, GPU 4 (MiB)
[05/15/2024-20:44:50] [E] Saving engine to file failed.
[05/15/2024-20:44:50] [E] Engine set up failed

Please check and have a nice day

May 16 '24 02:05 JYS997760473

And if I remove the LayerNorm or BatchNormlization block, can success to generate the engine file.

May 16 '24 02:05 JYS997760473

You can try to convert these two modules(LayerNorm or BatchNormlization block as a subgraph onnx) separately.

May 17 '24 08:05 lix19937

[05/15/2024-20:44:50] [E] Saving engine to file failed.

no disk space?

May 17 '24 12:05 zerollzeng

[05/15/2024-20:44:50] [E] Saving engine to file failed.

no disk space?

Hi, thanks for your reply. I tried again with new .pt, and success to create the engine file. And there is one more thing I would like to make clear, is that up till now, we cannot use LayerNormalization operator on Orin DRIVE unless write a TensorRT Plugin by myself?

May 17 '24 12:05 JYS997760473

Please check our release note, I think you need at least TRT 8.6 or 9.0, can't remember exactly which one.

May 19 '24 03:05 zerollzeng

closing since no activity for more than 3 weeks, pls reopen if you still have question, thanks all!

Jul 02 '24 16:07 ttyio

TensorRT TensorRT copied to clipboard

Converting engine file from onnx file with ReduceMax failure of TensorRT 8.5.10 when running trtexec on GPU Orin

Description

Environment

Relevant Files

Steps To Reproduce

TensorRT
TensorRT copied to clipboard