TensorRT icon indicating copy to clipboard operation
TensorRT copied to clipboard

Initialization failure of TensorRT 8.5.1.7 when running bcdu model on GPU A5000

Open d5423197 opened this issue 1 year ago • 16 comments

Description

I tried to run model (onnx) through onnxruntime with TensorrtExecutionProvider. But the initialization is failed.

Error msg:

2024-09-09 10:58:29.082851313 [E:onnxruntime:Default, tensorrt_execution_provider.h:58 log] [2024-09-09 02:58:29 ERROR] [concatenationLayer.cpp::estimateOutputDims::110] Error Code 4: Internal Error ((Unnamed Layer* 73) [Concatenation]: all concat input tensors must have the same dimensions except on the concatenation axis (1), but dimensions mismatched at index 0. Input 0 shape: [2,64,64,256], Input 1 shape: [0,64,64,256])

Environment

TensorRT Version: TensorRT 8.5.1.7

NVIDIA GPU: A5000

NVIDIA Driver Version: 11.4

CUDA Version: 11.4

CUDNN Version:

Operating System:

Python Version (if applicable): 3.8.0

Tensorflow Version (if applicable): 2.8.0

PyTorch Version (if applicable): N/A

Baremetal or Container (if so, version): N/A

Relevant Files

Model link: https://github.com/rezazad68/BCDU-Net/blob/master/Lung%20Segmentation/models.py

Steps To Reproduce

  1. Create the tf model
  2. Convert using tf2onnx
  3. Initialize using onnxruntime with TensorrtExecutionProvider backend

d5423197 avatar Sep 09 '24 03:09 d5423197

Btw, I have confirmed this issue is realted to ConvLSTM2D layer. Because I have tested, if I just created the model before ConvLSTM2D layer added, the model can be initialized successfully. But if I added ConvLSTM2D layer, it will be failed.

d5423197 avatar Sep 09 '24 03:09 d5423197

import models as M
model = M.BCDU_net_D3(input_size=input_shape, traning=False)
spec = (tf.TensorSpec((1, 256, 256, 3), tf.float32, name="input"),)
model_proto, _ = tf2onnx.convert.from_keras(model, input_signature=spec, opset=13, output_path=out_path)

The code to get the onnx model. onnxruntime version: onnxruntime-gpu==1.12.0

d5423197 avatar Sep 09 '24 03:09 d5423197

Thanks for the updated ticket info. Could you mention your OS version just for reference? Also, have you tried running the onnx with https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec rather than ORT?

moraxu avatar Sep 09 '24 18:09 moraxu

Hi @moraxu ,

No I have not tried trtexec. I am a python user.

OS version: Ubuntu 20.04

d5423197 avatar Sep 10 '24 00:09 d5423197

Oh, it's just the executable that's called like that, it can be run on Linux. As was mentioned in https://github.com/NVIDIA/TensorRT/issues/4109#issuecomment-2335112830, we'd like to be sure the issue can be isolated to TRT itself, rather than ORT. Do you have access to the instructions here: https://github.com/NVIDIA/TensorRT/tree/main/samples/trtexec ?

Can you run it like that: ./trtexec --onnx=model.onnx on your model to confirm the issue persists? I'll file an internal bug then.

moraxu avatar Sep 10 '24 00:09 moraxu

@moraxu I installed tensorrt using pip. (The instruction from official README). I tried to build it using only tensorrt. The same error. Please check it.

Do you mean the pip version of tensorrt is different from the executable trtexec?

d5423197 avatar Sep 10 '24 07:09 d5423197

Thanks, to clarify, trtexec is a standalone binary tool included with the TRT SDK (typically available when you install TRT using the tar or deb packages from NVIDIA). It helps with quick model conversion and testing, but it's separate from the pip version.

The version of TRT installed via pip should be the same as the version of trtexec, assuming they're from the same release, so the issue might be with TRT itself.

I tried to build it using only tensorrt.

Could you paste the full Python snippet here, on how you invoke the builder etc.? Apologies for the questions, I'd need that to file the bug.

moraxu avatar Sep 10 '24 18:09 moraxu

import engine as eng
import argparse
from onnx import ModelProto
import tensorrt as trt

engine_name = "test_cseg"
onnx_path = "weights.120-0.12_fix_sim.onnx"
batch_size = 1

model = ModelProto()
with open(onnx_path, "rb") as f:
    model.ParseFromString(f.read())

d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size, d0, d1, d2]
engine = eng.build_engine(onnx_path, shape=shape)
eng.save_engine(engine, engine_name)

@moraxu

d5423197 avatar Sep 11 '24 02:09 d5423197

Thank you, I've instanced an internal bug, will let you know if more info is needed

moraxu avatar Sep 11 '24 18:09 moraxu

@d5423197 I was asked if you can try to run the model with a newer 10.x TRT version?

moraxu avatar Sep 18 '24 16:09 moraxu

This is a very obvious problem. This bug is realated to tensorflow ConvLSTM2D layer. Don't they know if they have made this layer compatible? @moraxu

d5423197 avatar Sep 26 '24 01:09 d5423197

@d5423197 but are you able to run this with a newer 10.x TRT version or are strictly limited to 8.5.1.7?

moraxu avatar Sep 26 '24 17:09 moraxu

@moraxu For now, I am strictly limited to 8.5.1.7.

d5423197 avatar Sep 27 '24 05:09 d5423197

I see. The issue has been fixed in the upcoming 10.6 release, though.

moraxu avatar Oct 15 '24 18:10 moraxu

@moraxu Thanks, may I ask about the specific cause of this problem?

d5423197 avatar Oct 16 '24 03:10 d5423197

A small issue in our vectorizer within our backend graph compiler.

moraxu avatar Oct 16 '24 05:10 moraxu