server Unclear issue in loading HWC FP16 tensorRT .plan model

Description When loading a tensorRT converted .plan model that was converted to HWC format, I receive the following error E0109 18:06:35.616075 66510 model_lifecycle.cc:626] failed to load 'Object_Detection' version 1: Invalid argument: unexpected vectorized dim is -1 for non-linear tensor 'input' for Object_Detection

The .onnx version of this model will load with no issue, but when converted to a tensorRT model, it will not load. The key factor here is that the following flag was used during tensorRT conversion --inputIOFormats="fp32:hwc" as is described here

Triton Information Jetpack5 release of version 2.35 TensorRT version 8.5.3.1

To Reproduce

Convert a .onnx (note, model was trained using in pytorch using HWC, but conversion to onnx seems to lose this) model using tensorRT using the following command to turn it it hwc format and then load it into the triton server ../TensorRT/trtexec --onnx=model_fp16.onnx --saveEngine=model_fp16.plan --buildOnly --verbose --device=0 --refit --inputIOFormats=fp32:hwc --fp16

Here is the input specified in the config

input [{
  name: "input"
  data_type: TYPE_FP32
  dims: [1,256,256,3]
}]

I have also tried the following two, same error

input [{
  name: "input"
  data_type: TYPE_FP32
  dims: [-1,256,256,3]
}]

and

input [{
  name: "input"
  data_type: TYPE_FP32
  dims: [1,3,256,256]
}]

Expected behavior The model successfully loads

Jan 09 '24 18:01 nathanjacobiOXOS

Can you share the max batch size in your model configuration?

From model configuration docs:

Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batch_size equal to 0, the full shape is formed as dims. For example, for the following configuration the shape of “input0” is [ -1, 16 ] and the shape of “output0” is [ -1, 4 ].

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html

Jan 10 '24 23:01 Tabrizian

I have my set as max_batch_size: 0 Any other size will yield: Internal: autofill failed for model 'Object_Detection': configuration specified max-batch 1 but TensorRT engine only supports max-batch 0

Jan 11 '24 22:01 nathanjacobiOXOS

Hi, following up here @Tabrizian , any ideas on what to try?

Jan 17 '24 18:01 nathanjacobiOXOS

Hey there @Tabrizian checking in again, any help would be greatly appreciated

Jan 23 '24 20:01 nathanjacobiOXOS

Hi @Tabrizian, it's been about a month with no word, any sort of update would be appreciated

Feb 05 '24 20:02 nathanjacobiOXOS

Hi @nathanjacobiOXOS I'm sorry about the delay.

Triton also supports auto-completing the model configuration for TRT models. Can you try running the model without any model configuration and --log-verbose=1. It is going to print the model configuration that Triton auto-completes.

Feb 07 '24 17:02 Tabrizian

@nathanjacobiOXOS Checking in if Iman's above comment was helpful. Want to keep this issue current so that we can get this resolved.

Feb 20 '24 17:02 dyastremsky

Hi @dyastremsky @Tabrizian

Here is the verbose output:

"name": "Object_Detection",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
    "latest": {
        "num_versions": 1
    }
},
"max_batch_size": 1,
"input": [
    {
        "name": "input",
        "data_type": "TYPE_FP32",
        "dims": [
            3,
            256,
            256
        ],
        "is_shape_tensor": false
    }
],
"output": [
    {
        "name": "predictions",
        "data_type": "TYPE_FP32",
        "dims": [
            275
        ],
        "is_shape_tensor": false
    }
],
"batch_input": [],
"batch_output": [],
"optimization": {
    "priority": "PRIORITY_DEFAULT",
    "input_pinned_memory": {
        "enable": true
    },
    "output_pinned_memory": {
        "enable": true
    },
    "gather_kernel_buffer_threshold": 0,
    "eager_batching": false
},
"instance_group": [
    {
        "name": "Object_Detection",
        "kind": "KIND_GPU",
        "count": 1,
        "gpus": [
            0
        ],
        "secondary_devices": [],
        "profile": [],
        "passive": false,
        "host_policy": ""
    }
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []

} I0221 16:47:44.351299 34238 model_state.cc:272] model configuration: { "name": "Object_Detection", "platform": "tensorrt_plan", "backend": "tensorrt", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 1, "input": [ { "name": "input", "data_type": "TYPE_FP32", "format": "FORMAT_NONE", "dims": [ 3, 256, 256 ], "is_shape_tensor": false, "allow_ragged_batch": false, "optional": false } ], "output": [ { "name": "predictions", "data_type": "TYPE_FP32", "dims": [ 275 ], "label_filename": "", "is_shape_tensor": false } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "Object_Detection", "kind": "KIND_GPU", "count": 1, "gpus": [ 0 ], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "model.plan", "cc_model_filenames": {}, "metric_tags": {}, "parameters": {}, "model_warmup": [] }

Feb 21 '24 16:02 nathanjacobiOXOS

And the result I0221 16:47:44.410786 34238 server.cc:673] +------------------+---------+-------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +------------------+---------+-------------------------------------------------------------------------------------------------------------------+ | Object_Detection | 1 | UNAVAILABLE: Invalid argument: unexpected vectorized dim is -1 for non-linear tensor 'input' for Object_Detection | +------------------+---------+-------------------------------------------------------------------------------------------------------------------+

Feb 21 '24 16:02 nathanjacobiOXOS

In the meantime, I have been exploring a custom built tensorrt engine, and have had no issues deserializing this exact same model.plan and inputing the HWC formatted data

Feb 21 '24 16:02 nathanjacobiOXOS

Here is this as well if it is helpful

0221 16:47:42.293758 34238 model_config_utils.cc:1839] I0221 16:47:42.293848 34238 model_config_utils.cc:1841] I0221 16:47:42.293896 34238 model_config_utils.cc:1841] I0221 16:47:42.293938 34238 model_config_utils.cc:1841] I0221 16:47:42.293981 34238 model_config_utils.cc:1841] I0221 16:47:42.294038 34238 model_config_utils.cc:1841] I0221 16:47:42.294083 34238 model_config_utils.cc:1841] I0221 16:47:42.294123 34238 model_config_utils.cc:1841] I0221 16:47:42.294166 34238 model_config_utils.cc:1841] I0221 16:47:42.294210 34238 model_config_utils.cc:1841] I0221 16:47:42.294259 34238 model_config_utils.cc:1841] I0221 16:47:42.294305 34238 model_config_utils.cc:1841] I0221 16:47:42.294346 34238 model_config_utils.cc:1841] I0221 16:47:42.294386 34238 model_config_utils.cc:1841] I0221 16:47:42.294496 34238 model_config_utils.cc:1841] I0221 16:47:42.294541 34238 model_config_utils.cc:1841] I0221 16:47:42.294581 34238 model_config_utils.cc:1841] I0221 16:47:42.294629 34238 model_config_utils.cc:1841] I0221 16:47:42.294671 34238 model_config_utils.cc:1841] I0221 16:47:42.295053 34238 model_state.cc:308] I0221 16:47:42.295289 34238 model_state.cc:354] ModelConfig 64-bit fields: ModelConfig::dynamic_batching::default_queue_policy::default_timeout_microseconds ModelConfig::dynamic_batching::max_queue_delay_microseconds ModelConfig::dynamic_batching::priority_queue_policy::value::default_timeout_microseconds ModelConfig::ensemble_scheduling::step::model_version ModelConfig::input::dims ModelConfig::input::reshape::shape ModelConfig::instance_group::secondary_devices::device_id ModelConfig::model_warmup::inputs::value::dims ModelConfig::optimization::cuda::graph_spec::graph_lower_bound::input::value::dim ModelConfig::optimization::cuda::graph_spec::input::value::dim ModelConfig::output::dims ModelConfig::output::reshape::shape ModelConfig::sequence_batching::direct::max_queue_delay_microseconds ModelConfig::sequence_batching::max_sequence_idle_microseconds ModelConfig::sequence_batching::oldest::max_queue_delay_microseconds ModelConfig::sequence_batching::state::dims ModelConfig::sequence_batching::state::initial_state::dims ModelConfig::version_policy::specific::versions Setting the CUDA device to GPU0 to auto-complete config for Object_Detection Using explicit serialized file 'model.plan' to auto-complete config for Object_Detection

Feb 21 '24 16:02 nathanjacobiOXOS

Hi there, the issue stems from this location in tensorrt_backend. I have confirmed that the tensorRT plan is format kHWC instance_->engine_->getBindingFormat(binding_index) == nvinfer1::TensorFormat::kHWC will return true.

I have not been able to debug this today but I will continue to investigate and try to solve this tomorrow, unless you guys can get around to it first. Any input would be helpful, thanks.

@Tabrizian @dyastremsky

Feb 21 '24 21:02 nathanjacobiOXOS

There seems to be inconsistency in the handling of kHWC vs nvinfer documentation. From the documentation of nvinfer, kHWC is described as "Non-vectorized channel-last format."

This line appears to assume that any non-linear format is vectorized. The call getBindingVectorizedDim returns -1 for this model, which should be the expected behavior given that this format is strictly unvectorized.

From the documentation of nvinfer1::ICudaEngine::getBindingVectorizedDim Return the dimension index that the buffer is vectorized, or -1 is the name is not found. Specifically -1 is returned if scalars per vector is 1.

Is -1 not then the expected behavior for this sort of model?

Additionally, I am finding a lot of issues with the batching, the value of max_batch_size seems to always be restricted to one, regardless of what value I set within the config.pbtxt. Could you please in a little bit more detail how I should be approaching the max_batch_size for a tensorRT model instance?

For example, I've been getting messages like this quite a bit during debugging

failed to load 'Object_Detection' version 1: Invalid argument: model 'Object_Detection', tensor 'input': the model expects 4 dimensions (shape [1,3,256,256]) but the model configuration specifies 5 dimensions (an initial batch dimension because max_batch_size > 0 followed by the explicit tensor shape, making complete shape [-1,1,3,256,256])

But my config has max_batch_size: 0

Thanks :)

Feb 22 '24 18:02 nathanjacobiOXOS

I'm unfamiliar with the code, but to me it appears that a -1 is unconditionally added to the full_dims of the tensorRT plan here

Feb 22 '24 19:02 nathanjacobiOXOS

Hi @Tabrizian @dyastremsky, any updates from your end?

Mar 05 '24 15:03 nathanjacobiOXOS

Can you start Triton without autocomplete (with the --disable-auto-complete-config flag)? That's going to give you more control for testing. I also believe autocomplete may have had some technical challenges making it difficult to detect a zero value for the batch from no value in its current implementation, which would explain why it is overwriting your 0 to 1 if it detects that the model supports batching. Perhaps that is related to the other issue.

If the above is not sufficient, you could also try running a more recent version of Triton in case any patches in the 8 months have resolved the above. I would be curious to see the results of disabling autocomplete though.

Mar 05 '24 17:03 dyastremsky

server server copied to clipboard

Unclear issue in loading HWC FP16 tensorRT .plan model

server
server copied to clipboard