server icon indicating copy to clipboard operation
server copied to clipboard

Questions about input and output shape in model configuration when batch size is 1

Open jackylu0124 opened this issue 1 year ago • 2 comments

Hey all, I have a question regarding the input and output shape configuration in the model configuration file. Basically I have a model that takes in images in the NCHW layout (more specifically C=3 and H and W can have variable-size positive integer values), and the model also outputs tensors in the NCHW layout (more specifically C=3, and H and W can have variable-size positive integer values). Due to the relatively large size of this model and also due to the limited memory on my GPU, I want to set a batch size of 1 for both the input and output tensor.

Based on my understanding of the following paragraph in the documentation https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/user_guide/model_configuration.html:

Input and output shapes are specified by a combination of max_batch_size and the dimensions specified by the input or output dims property. For models with max_batch_size greater-than 0, the full shape is formed as [ -1 ] + dims. For models with max_batch_size equal to 0, the full shape is formed as dims.

my questions are: 1. Are the following 2 configurations for the model input and output shapes equivalent, and do they have identical effects for specifying the input and output shape for the model? 2. From the inference server's point of view, are these two configurations treated in any different ways or are they indistinguishable from the inference server's point of view? 3. If the following 2 configurations are equivalent, are they considered to be equivalent by all the backends supported by the Triton Inference Server (e.g. onnxruntime backend, python backend, etc.)?

Configuration 1:

max_batch_size: 1

input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [3, -1, -1]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [3, -1, -1]
    }
]

Configuration 2:

max_batch_size: 0

input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [1, 3, -1, -1]
    }
]

output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [1, 3, -1, -1]
    }
]

Thank you very much for your time and help in advance!

jackylu0124 avatar May 16 '24 03:05 jackylu0124

@tanmayv25 can you help here?

statiraju avatar May 16 '24 22:05 statiraju

  1. Are the following 2 configurations for the model input and output shapes equivalent, and do they have identical effects for specifying the input and output shape for the model?

Both the configurations are identical. The client in both the cases have to provide input with shape [1,3,-1,-1], where -1 can be any positive integer number. And the received output will be of shape [1,3,-1,-1].

They have an identical impact as in Triton core will forward requests with input shape [1,3,-1,-1] to the backend and receive output of shape [1,3,-1,-1] from the backend.

  1. From the inference server's point of view, are these two configurations treated in any different ways or are they indistinguishable from the inference server's point of view?

Point of view will be identical. The only difference would be when dynamic_batching field is enabled. When the field is set with max_batch_size = 1, the request would go into an additional queue, and would be picked as soon as there is an available instance for execution. When dynamic batching is disabled, there is no difference even in the request control flow.

  1. If the following 2 configurations are equivalent, are they considered to be equivalent by all the backends supported by the Triton Inference Server (e.g. onnxruntime backend, python backend, etc.)?

They will be completely identical from backend's perspective. Even more so max_batch_size value is not even propagated to the backend during inference execution. However, the backend's during auto-completion of model config, may enable dynamic_batching setting which can introduce an extra queue transaction in the control flow. None of the standard backends in my knowledge do that for max_batch_size=1 (only when max_batch_size > 1). The tensorflow behavior can be found here: https://github.com/triton-inference-server/tensorflow_backend?tab=readme-ov-file#dynamic-batching

tanmayv25 avatar May 20 '24 19:05 tanmayv25

@tanmayv25 Got it, thank you very much for your detailed explanation and clarification! I really appreciate it!

jackylu0124 avatar May 22 '24 14:05 jackylu0124