server
                                
                                 server copied to clipboard
                                
                                    server copied to clipboard
                            
                            
                            
                        How to make use of dynamic batching with Triton Python backend?
Description
I would not say this is a bug but I cannot make use of the dynamic_batching feature with the Triton Python Backend.
Triton Information Triton image: nvcr.io/nvidia/tritonserver:23.10-pyt-python-py3
To Reproduce
I created a dummy Python backend, which accepts input named tensors and returns output named tensors.
- model.py
import logging
import triton_python_backend_utils as pb_utils
import numpy as np
logging.basicConfig(format="%(asctime)s %(message)s")
logger = logging.getLogger()
logger.setLevel(logging.INFO)
class TritonPythonModel:
    def initialize(self, args):
        logger.info("Init")
    def execute(self, requests):  
        logger.info(f"Input length: {len(requests)}")  
        responses = []
        for request in requests:
            
            input = pb_utils.get_input_tensor_by_name(request, "input").as_numpy()
            
            logger.info(f"Input shape: {input.shape}")
            
            ###
            # Perform inference task against the `input` tensor
            ###
            
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[pb_utils.Tensor("output", np.asarray("Dummy output", dtype=object))]
            )
            responses.append(inference_response)
        
        return responses
    def finalize(self):
        logger.info("Cleaned")
- config.pbtxt
name: "dummy"
backend: "python"
max_batch_size: 8
instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
input [
    {
        name: "input"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]
output [
    {
        name: "output"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]
dynamic_batching {
}
- Start server command
tritonserver --model-repository=/models --log-verbose=1 --model-control-mode=explicit --load-model=dummy
- Test
docker run -it --rm --net host nvcr.io/nvidia/tritonserver:23.11-py3-sdk bash
perf_analyzer -m dummy --percentile=95 --concurrency-range 1:4 --shape input:7
Expected behavior
Following this optimization-related documentation, I believe that when we enable dynamic batching, triton will automatically stack up requests to a batched input. Let's say one input has a shape of (1,7), based on the above perf_analyzer command, after using dynamic batch, the shape should be (x,7) with x larger than 1 and in the range of 2 to 8 - my max batch size.
However, in practice, what I receive is a list of requests, each of which has a single input.
# Examples logging
2023-12-26 08:21:39,316 Input length: 3
2023-12-26 08:21:39,316 Input shape: (1, 6)
2023-12-26 08:21:39,316 Input shape: (1, 6)
2023-12-26 08:21:39,316 Input shape: (1, 6)
2023-12-26 08:21:39,317 Input length: 1
2023-12-26 08:21:39,317 Input shape: (1, 6)
2023-12-26 08:21:39,318 Input length: 3
2023-12-26 08:21:39,318 Input shape: (1, 6)
2023-12-26 08:21:39,318 Input shape: (1, 6)
2023-12-26 08:21:39,318 Input shape: (1, 6)
When I disable dynamic_batching, there are always single batch input requests.
Based on practiced results, I currently understand that triton batches multiple requests to a list of requests and then sends it to one instance of my model. In terms of dynamic_batch, it should stack up input of multiple requests on a single request with a larger batch size input, doesn't it?
Btw, happy holiday you guys!
Dynamic batching is a feature of Triton Inference Server that allows inference requests to be combined by the server, so that a batch is created dynamically. This typically results in increased throughput.
To use dynamic batching with the Python backend in Triton, you need to understand that the Python backend is unique compared to some of the other backends (like TensorFlow, PyTorch, etc.). It doesn't currently implement the actual "batching" logic for you. It supports dynamic batching in the sense that Triton can gather the requests in the server/core and then send those requests together in a single API call to the model.
In your model's execute method, you should handle these batches of requests appropriately. For example, if your model is a using PyTorch or TensorFlow, you might need to stack the inputs along a new dimension to create a batched tensor, then pass this batched tensor through your model.
Dynamic batching is a feature of Triton Inference Server that allows inference requests to be combined by the server, so that a batch is created dynamically. This typically results in increased throughput.
To use dynamic batching with the Python backend in Triton, you need to understand that the Python backend is unique compared to some of the other backends (like TensorFlow, PyTorch, etc.). It doesn't currently implement the actual "batching" logic for you. It supports dynamic batching in the sense that Triton can gather the requests in the server/core and then send those requests together in a single API call to the model.
In your model's execute method, you should handle these batches of requests appropriately. For example, if your model is a using PyTorch or TensorFlow, you might need to stack the inputs along a new dimension to create a batched tensor, then pass this batched tensor through your model.
name: "dummy"
backend: "onnxruntime"
max_batch_size: 8
instance_group [
    {
        count: 1
        kind: KIND_GPU
    }
]
input [
    {
        name: "input"
        data_type: TYPE_FP32
        dims: [ -1 ]
    }
]
output [
    {
        name: "output"
        data_type: TYPE_FP32
        dims: [ -1 ]
    }
]
dynamic_batching {
2,4,8
}
When the Tensor shape in multiple requests is inconsistent, will Triton automatically complete the padding operation group the inputs in these multiple requests into one batch? This situation is very common, for example, when the ONNX backend model above receives a request with an input shape of 1 * 15 and an input shape of 3 * 20, will Triton automatically add these two inputs to 4 * 20? If triton will automatically complete, is there an interface to control the value of padding? If it cannot be completed automatically, will the dynamic_batching of triton only combine inputs with the same dimensions? So I need to add a Python backend to pad the inputs from multiple Requests into same dimensions in order to use dynamic_batching?I mainly want to know if Triton provides an interface to automatically complete the padding operation for multiple requests. Thanks