server
server copied to clipboard
Failed to set cuda graph shape when I set max_batch_size==0
Description cuda graph failed when I set max_batch_size==0
Triton Information What version of Triton are you using? 22.04
Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:22.04-py3
To Reproduce model I used pytorch ResNet18 pretrained model ,and converted to onnx model
import torch
from torch import nn
import torchvision
import argparse
import torchvision.models as models
parser = argparse.ArgumentParser()
parser.add_argument("--output_model", type=str, required=True, help="model output path")
def main():
args = parser.parse_args()
output_model_path = args.output_model
model = models.resnet18()
model = model.to('cuda:0')
model.eval()
x = torch.ones(1, 3, 224, 224).to('cuda:0')
torch.onnx.export(
model=model,
args=x,
f=output_model_path,
opset_version=11,
export_params=True,
do_constant_folding=True,
input_names = ['INPUT__0'],
output_names = ['OUTPUT__0'],
dynamic_axes={'INPUT__0' : {0:'bs'}, 'OUTPUT__0' : {0:'bs'}}
)
if __name__ == '__main__':
main()
then I converted it to tensorrt plan file
trtexec --onnx=resnet18.onnx --explicitBatch --optShapes=INPUT__0:5x3x224x224 --buildOnly --saveEngine=resnet18.plan --workspace=12288 --device=1
tritonserver config my tritonserver config
platform: "tensorrt_plan"
max_batch_size : 0
input: [
{
name: "INPUT__0",
data_type: TYPE_FP32,
dims: [5, 3, 224, 224],
}
],
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
optimization{
graph: {
level : 1
},
eager_batching : 1,
cuda: {
graphs:1,
graph_spec: [
{
input: {
key: "INPUT__0",
value: {dim:[5, 3, 224, 224]}
}
}
],
busy_wait_events:1,
output_copy_stream: 1
}
}
result when I run tritonserver,I get following error:
I0823 14:33:25.438942 8964 tensorrt.cc:3193] Detected INPUT__0 as execution binding for resnet_0
I0823 14:33:25.438952 8964 tensorrt.cc:3193] Detected OUTPUT__0 as execution binding for resnet_0
E0823 14:33:25.454761 8964 logging.cc:43] 3: [executionContext.cpp::setBindingDimensions::945] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::945, condition: engineDims.nbDims == dimensions.nbDims
)
E0823 14:33:25.454795 8964 tensorrt.cc:5090] Failed to set cuda graph shape for resnet_0trt failed to set binding dimension to [1,5,3,224,224] for binding 0 for resnet_0
I0823 14:33:25.454810 8964 tensorrt.cc:1426] Created instance resnet_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0823 14:33:25.454952 8964 backend_model_instance.cc:734] Starting backend thread for resnet_0 at nice 0 on device 0...
I0823 14:33:25.455107 8964 model_repository_manager.cc:1352] successfully loaded 'resnet' version 1
It will add batch dimension and making the original dimension 5x3x224x224 become 1x5x224x224. I found that tensorrt backend increases this dimension when max_batch_size is equal to 0 https://github.com/triton-inference-server/tensorrt_backend/blob/main/src/tensorrt.cc#L5348. I also tried to set the batch size in the cuda spec to 5, but he will get an error when verifying the configuration, the configuration as follows
platform: "tensorrt_plan"
max_batch_size : 0
input: [
{
name: "INPUT__0",
data_type: TYPE_FP32,
dims: [5, 3, 224, 224],
}
],
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [0]
}
]
optimization{
graph: {
level : 1
},
eager_batching : 1,
cuda: {
graphs:1,
graph_spec: [
{
batch_size:5,
input: {
key: "INPUT__0",
value: {dim:[3, 224, 224]}
}
}
],
busy_wait_events:1,
output_copy_stream: 1
}
}
Expected behavior We don't want to use dynamic batcher, so we need to set max_batch_size to 0, and we also need to use cuda graph,how do we need to configure these two features
thanks
Hi @wangchengdng ,
In your first example where, did you try explicitly setting batch_size: 0
for the cuda graph as described here? Zero might be the default when excluded, but just want to double check.
optimization{
graph: {
level : 1
},
eager_batching : 1,
cuda: {
graphs:1,
graph_spec: [
{
batch_size: 0, <--------------------
input: {
key: "INPUT__0",
value: {dim:[5, 3, 224, 224]}
}
}
],
busy_wait_events:1,
output_copy_stream: 1
}
}
If that still doesn't work, @tanmayv25 any thoughts?
Hi @wangchengdng ,
In your first example where, did you try explicitly setting
batch_size: 0
for the cuda graph as described here? Zero might be the default when excluded, but just want to double check.optimization{ graph: { level : 1 }, eager_batching : 1, cuda: { graphs:1, graph_spec: [ { batch_size: 0, <-------------------- input: { key: "INPUT__0", value: {dim:[5, 3, 224, 224]} } } ], busy_wait_events:1, output_copy_stream: 1 } }
If that still doesn't work, @tanmayv25 any thoughts?
@rmccorm4 thank you.I tried it but it didn't work, I see batch_size=0 is added in config autocompletion
{
"name": "resnet",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [
{
"name": "INPUT__0",
"data_type": "TYPE_FP32",
"format": "FORMAT_NONE",
"dims": [
5,
3,
224,
224
],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}
],
"output": [
{
"name": "OUTPUT__0",
"data_type": "TYPE_FP32",
"dims": [
5,
1000
],
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"graph": {
"level": 1
},
"priority": "PRIORITY_DEFAULT",
"cuda": {
"graphs": true,
"busy_wait_events": true,
"graph_spec": [
{
"batch_size": 0,
"input": {
"INPUT__0": {
"dim": [
"5",
"3",
"224",
"224"
]
}
}
}
],
"output_copy_stream": true
},
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": true
},
"instance_group": [
{
"name": "resnet_0",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
Is Triton able to load the model when not using cuda graph and other optimizations? Can you provide no config.pbtxt for the model and share what config Triton generates/autocompletes for the model?
Is Triton able to load the model when not using cuda graph and other optimizations? Can you provide no config.pbtxt for the model and share what config Triton generates/autocompletes for the model?
@tanmayv25 Yes,Triton can successfully load the model whether cuda graph is enabled or not.The config when I provide no config.pbtxt as follows
{
"name": "resnet",
"platform": "tensorrt_plan",
"backend": "tensorrt",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 5,
"input": [
{
"name": "INPUT__0",
"data_type": "TYPE_FP32",
"dims": [
3,
224,
224
],
"is_shape_tensor": false
}
],
"output": [
{
"name": "OUTPUT__0",
"data_type": "TYPE_FP32",
"dims": [
1000
],
"is_shape_tensor": false
}
],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [
{
"name": "resnet",
"kind": "KIND_GPU",
"count": 1,
"gpus": [
0,
1,
2,
3
],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}
],
"default_model_filename": "model.plan",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": [],
"dynamic_batching": {}
}
Should be fixed by https://github.com/triton-inference-server/tensorrt_backend/pull/48. The test is added here: https://github.com/triton-inference-server/server/pull/4913 Fix will be officially available in Triton 22.10 release.