TensorRT TensorRT 10 slower than TensorRT 8.6 for models with Instance Normalization layers

Description

After migrating my backend to TensorRT 10, I've noticed that some models are slower with TensorRT-10. Looks like the issue comes from the mapping on some InstanceNormalization layers that are not using the Instance Normalization plugin anymore.

Here are the logs for one layer returned by TensorRT before and after the migration:

With TensorRT 8.6: [06/24/2024-10:18:03] [V] [TRT] Parsing node: /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Reshape_output_0 [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Constant_1_output_0 [06/24/2024-10:18:03] [V] [TRT] Searching for input: /encoder/0/resnets.0/norm1/Constant_2_output_0 [06/24/2024-10:18:03] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] inputs: [/encoder/0/resnets.0/norm1/Reshape_output_0 -> (1, 32, -1)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_1_output_0 -> (32)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_2_output_0 -> (32)[FLOAT]], [06/24/2024-10:18:03] [V] [TRT] Original shape: (1, 32, _), unsqueezing to: (_, _, _, _) [06/24/2024-10:18:03] [V] [TRT] Local registry did not find InstanceNormalization_TRT creator. Will try parent registry if enabled. [06/24/2024-10:18:03] [V] [TRT] Global registry found InstanceNormalization_TRT creator. [06/24/2024-10:18:03] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/InstanceNormalization for ONNX node: /encoder/0/resnets.0/norm1/InstanceNormalization [06/24/2024-10:18:03] [V] [TRT] Original shape: (1, 32, _, 1), squeezing to: (_, _, _) [06/24/2024-10:18:03] [V] [TRT] Registering tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 for ONNX tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 [06/24/2024-10:18:03] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] outputs: [/encoder/0/resnets.0/norm1/InstanceNormalization_output_0 -> (1, 32, -1)[FLOAT]],

With TensorRT 10: [06/24/2024-10:15:27] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] inputs: [/encoder/0/resnets.0/norm1/Reshape_output_0 -> (1, 32, -1)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_1_output_0 -> (32)[FLOAT]], [/encoder/0/resnets.0/norm1/Constant_2_output_0 -> (32)[FLOAT]], [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/Constant_1_output_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/Constant_2_output_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Original shape: (32,), unsqueezing to: (1, 32, 1) [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_ShapeShuffle_0 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_unsqueezeTensor required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Original shape: (32,), unsqueezing to: (1, 32, 1) [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_ShapeShuffle_1 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: ONNXTRT_unsqueezeTensor_2 required by ONNX-TRT [06/24/2024-10:15:27] [V] [TRT] Registering layer: /encoder/0/resnets.0/norm1/InstanceNormalization for ONNX node: /encoder/0/resnets.0/norm1/InstanceNormalization [06/24/2024-10:15:27] [V] [TRT] Registering tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 for ONNX tensor: /encoder/0/resnets.0/norm1/InstanceNormalization_output_0 [06/24/2024-10:15:27] [V] [TRT] /encoder/0/resnets.0/norm1/InstanceNormalization [InstanceNormalization] outputs: [/encoder/0/resnets.0/norm1/InstanceNormalization_output_0 -> (1, 32, -1)[FLOAT]],

Any ideas on how to bring back the instance normalization plugin so that I can reach the expected perfs?

Environment

TensorRT Version:

NVIDIA GPU:

NVIDIA Driver Version:

CUDA Version:

CUDNN Version:

Operating System:

Python Version (if applicable):

Tensorflow Version (if applicable):

PyTorch Version (if applicable):

Baremetal or Container (if so, version):

Relevant Files

Model link:

Steps To Reproduce

Commands or scripts:

Have you tried the latest release?:

Can this model run on other frameworks? For example run ONNX model with ONNXRuntime (polygraphy run <model.onnx> --onnxrt):

Jun 24 '24 10:06 david-PHR

If you use trtexec tool, please add --builderOptimizationLevel=5

Jun 24 '24 11:06 lix19937

Adding a builderOptimizationLevel=5 returns these issues at inference time:

assert self.context.execute_v2(bindings=bindings), "failure during execution of inference" AssertionError: failure during execution of inference [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7bc29b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7d114c0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7d8e390'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7deb0b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7e540b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7eb1720'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f0f140'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f6cc10'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc7f7a410'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc80437b0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc80abde0'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc8114220'.) [06/24/2024-13:04:14] [TRT] [E] 1: [graphContext.h::~MyelinGraphContext::72] Error Code 1: Myelin ([impl.cpp:cuda_object_deallocate:474] Error 716 destroying stream '0x5f2fc8185360'.) [06/24/2024-13:04:14] [TRT] [E] 1: [multiStreamContext.cpp::maybeDestroyAuxStream::263] Error Code 1: Cuda Runtime (misaligned address) [06/24/2024-13:04:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (misaligned address) [06/24/2024-13:04:14] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::52] Error Code 1: Cuda Runtime (misaligned address)

However with level 4 it seems better. Note that I've find this trtexec argument too --pluginInstanceNorm. There is an issue when running a model that have been compiled in that way with PluginInstanceNorm in the nvcr.io/nvidia/pytorch:24.05-py3 container. In fact from these lines and these lines the expected cudnn major version is 8. But this image comes with a pre-installed major cudnn version of 9. It solved by doing this ugly symlink, but this is in the TensorRT daily philosophy: ln -s /usr/lib/x86_64-linux-gnu/libcudnn.so.9 /usr/lib/x86_64-linux-gnu/libcudnn.so.8. Is there any restriction in using cudnn 9 here? looks everything works well

Jun 26 '24 10:06 david-PHR

I've forgot to share this line too that magically setup the CUDNN_MAJOR version in the cudnn Wrapper code

Jun 26 '24 10:06 david-PHR

Yes, you can relink to cudnn.

Adding a builderOptimizationLevel=5 returns these issues at inference time:

can you upload full log here ? @david-PHR

Jun 26 '24 10:06 lix19937

@david-PHR is this still reproducible on TensorRT 10.8?

Feb 11 '25 16:02 brnguyen2

Closing due to inactive. Please feel free to reopen!

May 29 '25 21:05 poweiw