server icon indicating copy to clipboard operation
server copied to clipboard

Failed to set cuda graph shape when I set max_batch_size==0

Open wangchengdng opened this issue 2 years ago • 4 comments

Description cuda graph failed when I set max_batch_size==0

Triton Information What version of Triton are you using? 22.04

Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:22.04-py3

To Reproduce model I used pytorch ResNet18 pretrained model ,and converted to onnx model

import torch
from torch import nn
import torchvision
import argparse
import torchvision.models as models

parser = argparse.ArgumentParser()
parser.add_argument("--output_model", type=str, required=True, help="model output path")

def main():
    args = parser.parse_args()
    output_model_path = args.output_model
    model = models.resnet18()
    model = model.to('cuda:0')
    model.eval()
    x = torch.ones(1, 3, 224, 224).to('cuda:0')
    torch.onnx.export(
            model=model,
            args=x,
            f=output_model_path,
            opset_version=11,
            export_params=True,
            do_constant_folding=True,
            input_names = ['INPUT__0'],
            output_names = ['OUTPUT__0'],
            dynamic_axes={'INPUT__0' : {0:'bs'}, 'OUTPUT__0' : {0:'bs'}}
        )

if __name__ == '__main__':
    main()

then I converted it to tensorrt plan file

trtexec --onnx=resnet18.onnx --explicitBatch --optShapes=INPUT__0:5x3x224x224 --buildOnly --saveEngine=resnet18.plan --workspace=12288 --device=1

tritonserver config my tritonserver config

platform: "tensorrt_plan"
max_batch_size : 0
input: [
    {
        name: "INPUT__0",
        data_type: TYPE_FP32,
        dims: [5, 3, 224, 224],
    }
],
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]
optimization{
  graph: {
      level : 1
  },
  eager_batching : 1,
  cuda: {
    graphs:1,
    graph_spec: [
      {
        input: {
            key: "INPUT__0",
            value: {dim:[5, 3, 224, 224]}
        }
      }
    ],
    busy_wait_events:1,
    output_copy_stream: 1
  }
}

result when I run tritonserver,I get following error:

I0823 14:33:25.438942 8964 tensorrt.cc:3193] Detected INPUT__0 as execution binding for resnet_0
I0823 14:33:25.438952 8964 tensorrt.cc:3193] Detected OUTPUT__0 as execution binding for resnet_0
E0823 14:33:25.454761 8964 logging.cc:43] 3: [executionContext.cpp::setBindingDimensions::945] Error Code 3: API Usage Error (Parameter check failed at: runtime/api/executionContext.cpp::setBindingDimensions::945, condition: engineDims.nbDims == dimensions.nbDims
)
E0823 14:33:25.454795 8964 tensorrt.cc:5090] Failed to set cuda graph shape for resnet_0trt failed to set binding dimension to [1,5,3,224,224] for binding 0 for resnet_0
I0823 14:33:25.454810 8964 tensorrt.cc:1426] Created instance resnet_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0823 14:33:25.454952 8964 backend_model_instance.cc:734] Starting backend thread for resnet_0 at nice 0 on device 0...
I0823 14:33:25.455107 8964 model_repository_manager.cc:1352] successfully loaded 'resnet' version 1

It will add batch dimension and making the original dimension 5x3x224x224 become 1x5x224x224. I found that tensorrt backend increases this dimension when max_batch_size is equal to 0 https://github.com/triton-inference-server/tensorrt_backend/blob/main/src/tensorrt.cc#L5348. I also tried to set the batch size in the cuda spec to 5, but he will get an error when verifying the configuration, the configuration as follows

platform: "tensorrt_plan"
max_batch_size : 0
input: [
    {
        name: "INPUT__0",
        data_type: TYPE_FP32,
        dims: [5, 3, 224, 224],
    }
],
instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [0]
  }
]
optimization{
  graph: {
      level : 1
  },
  eager_batching : 1,
  cuda: {
    graphs:1,
    graph_spec: [
      {
        batch_size:5,
        input: {
            key: "INPUT__0",
            value: {dim:[3, 224, 224]}
        }
      }
    ],
    busy_wait_events:1,
    output_copy_stream: 1
  }
}

Expected behavior We don't want to use dynamic batcher, so we need to set max_batch_size to 0, and we also need to use cuda graph,how do we need to configure these two features

thanks

wangchengdng avatar Aug 23 '22 15:08 wangchengdng

Hi @wangchengdng ,

In your first example where, did you try explicitly setting batch_size: 0 for the cuda graph as described here? Zero might be the default when excluded, but just want to double check.

optimization{
  graph: {
      level : 1
  },
  eager_batching : 1,
  cuda: {
    graphs:1,
    graph_spec: [
      {
        batch_size: 0,    <--------------------
        input: {
            key: "INPUT__0",
            value: {dim:[5, 3, 224, 224]}
        }
      }
    ],
    busy_wait_events:1,
    output_copy_stream: 1
  }
}

If that still doesn't work, @tanmayv25 any thoughts?

rmccorm4 avatar Aug 23 '22 22:08 rmccorm4

Hi @wangchengdng ,

In your first example where, did you try explicitly setting batch_size: 0 for the cuda graph as described here? Zero might be the default when excluded, but just want to double check.

optimization{
  graph: {
      level : 1
  },
  eager_batching : 1,
  cuda: {
    graphs:1,
    graph_spec: [
      {
        batch_size: 0,    <--------------------
        input: {
            key: "INPUT__0",
            value: {dim:[5, 3, 224, 224]}
        }
      }
    ],
    busy_wait_events:1,
    output_copy_stream: 1
  }
}

If that still doesn't work, @tanmayv25 any thoughts?

@rmccorm4 thank you.I tried it but it didn't work, I see batch_size=0 is added in config autocompletion

{
    "name": "resnet",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 0,
    "input": [
        {
            "name": "INPUT__0",
            "data_type": "TYPE_FP32",
            "format": "FORMAT_NONE",
            "dims": [
                5,
                3,
                224,
                224
            ],
            "is_shape_tensor": false,
            "allow_ragged_batch": false,
            "optional": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT__0",
            "data_type": "TYPE_FP32",
            "dims": [
                5,
                1000
            ],
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "graph": {
            "level": 1
        },
        "priority": "PRIORITY_DEFAULT",
        "cuda": {
            "graphs": true,
            "busy_wait_events": true,
            "graph_spec": [
                {
                    "batch_size": 0,
                    "input": {
                        "INPUT__0": {
                            "dim": [
                                "5",
                                "3",
                                "224",
                                "224"
                            ]
                        }
                    }
                }
            ],
            "output_copy_stream": true
        },
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": true
    },
    "instance_group": [
        {
            "name": "resnet_0",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": []
}

wangchengdng avatar Aug 24 '22 03:08 wangchengdng

Is Triton able to load the model when not using cuda graph and other optimizations? Can you provide no config.pbtxt for the model and share what config Triton generates/autocompletes for the model?

tanmayv25 avatar Aug 24 '22 19:08 tanmayv25

Is Triton able to load the model when not using cuda graph and other optimizations? Can you provide no config.pbtxt for the model and share what config Triton generates/autocompletes for the model?

@tanmayv25 Yes,Triton can successfully load the model whether cuda graph is enabled or not.The config when I provide no config.pbtxt as follows

{
    "name": "resnet",
    "platform": "tensorrt_plan",
    "backend": "tensorrt",
    "version_policy": {
        "latest": {
            "num_versions": 1
        }
    },
    "max_batch_size": 5,
    "input": [
        {
            "name": "INPUT__0",
            "data_type": "TYPE_FP32",
            "dims": [
                3,
                224,
                224
            ],
            "is_shape_tensor": false
        }
    ],
    "output": [
        {
            "name": "OUTPUT__0",
            "data_type": "TYPE_FP32",
            "dims": [
                1000
            ],
            "is_shape_tensor": false
        }
    ],
    "batch_input": [],
    "batch_output": [],
    "optimization": {
        "priority": "PRIORITY_DEFAULT",
        "input_pinned_memory": {
            "enable": true
        },
        "output_pinned_memory": {
            "enable": true
        },
        "gather_kernel_buffer_threshold": 0,
        "eager_batching": false
    },
    "instance_group": [
        {
            "name": "resnet",
            "kind": "KIND_GPU",
            "count": 1,
            "gpus": [
                0,
                1,
                2,
                3
            ],
            "secondary_devices": [],
            "profile": [],
            "passive": false,
            "host_policy": ""
        }
    ],
    "default_model_filename": "model.plan",
    "cc_model_filenames": {},
    "metric_tags": {},
    "parameters": {},
    "model_warmup": [],
    "dynamic_batching": {}
}

wangchengdng avatar Aug 25 '22 06:08 wangchengdng

Should be fixed by https://github.com/triton-inference-server/tensorrt_backend/pull/48. The test is added here: https://github.com/triton-inference-server/server/pull/4913 Fix will be officially available in Triton 22.10 release.

tanmayv25 avatar Sep 23 '22 18:09 tanmayv25