server
server copied to clipboard
Inference in Triton ensemble model is much slower than single model in Triton
Description
I'm using Triton Server ensemble model for several models connected to each other. Let's say [Model A, Model B, Model C, Model D]. The ensemble model takes input image and pass it sequentially to this pipeline (e.g. Model A then Model B then Model C then Model D in order). Only one model is deep learning model (runs on GPU) which is model B, and the other three models are ran on CPU (Model A, Model C and Model D. Typically pre-processing and post-processing models). I use:
- Dynamic Batching: Each model produces a single-batch image (1, 3, w, h), but I have multiple clients connecting to Triton.
- Ragged tensors (Model C) produces a variable number of detections.
- Tensorrt accelerator for GPU model (GPU utilization from metrics: 0.16)
- OpenVino accelerator for CPU models (CPU utilization from metrics: 0.997)
It looks like the CPU and GPU utilization is good, except it is not batching correctly! I get a VERY LOW FPS while inferring in Triton compared to outside Triton! For instance, If I set only Model B (Deep Learning model) and I get the pre-processing along with the post-processing outside Triton it performs much better (~25 FPS). But I get (~6 FPS) If I do ensemble model with pre-processing and post-processing is in Triton as ensemble model.
Triton Information
What version of Triton are you using?
nvcr.io/nvidia/tritonserver:24.04-py3
Are you using the Triton container or did you build it yourself?
Trtiton Container
To Reproduce
- Model A Config.pbtxt
name: "ModelA"
backend: "python"
max_batch_size: 64
input [
{
name: "input0"
data_type: TYPE_UINT8
dims: [ 1520, 2688, 3 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP16
dims: [ 3, 768, 1280 ]
}
]
dynamic_batching {
}
instance_group [
{
count: 8
kind: KIND_CPU
}
]
optimization {
execution_accelerators {
cpu_execution_accelerator : [{
name : "openvino"
}]
}
}
- Model B Config.pbtxt:
name: "ModelB"
backend: "tensorrt"
max_batch_size: 64
input [
{
name: "input0"
data_type: TYPE_FP16
dims: [3 , 768, 1280]
}]
output [
{
name: "output0"
data_type: TYPE_FP16
dims: [61200, 8]
}
]
dynamic_batching {
}
instance_group [
{
count: 2
kind: KIND_GPU
}
]
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
parameters { key: "max_workspace_size_bytes" value: "4294967296" }
parameters { key: "trt_engine_cache_enable" value: "1" }
}]
}
}
- Model C Config.pbtxt
name: "ModelC"
backend: "python"
max_batch_size: 64
input [
{
name: "input0"
data_type: TYPE_FP16
dims: [61200, 8]
}
]
output [
{
name: "output0"
data_type: TYPE_FP16
dims: [ -1, 6 ]
}
]
dynamic_batching {
}
instance_group [
{
count: 8
kind: KIND_CPU
}
]
optimization {
execution_accelerators {
cpu_execution_accelerator : [{
name : "openvino"
}]
}
}
- Model D Config.pbtxt
name: "ModelD"
backend: "python"
max_batch_size: 64
input [
{
name: "input0"
data_type: TYPE_FP16
dims: [ -1, 6 ]
allow_ragged_batch: true
}
]
batch_input [
{
kind: BATCH_ACCUMULATED_ELEMENT_COUNT
target_name: "INDEX"
data_type: TYPE_FP32
source_input: "detection_bytetracker_input"
}
]
input [
{
name: "input1"
data_type: TYPE_FP16
dims: [ 1 ]
}
]
output [
{
name: "input2"
data_type: TYPE_FP16
dims: [ -1, 7 ]
}
]
dynamic_batching {
}
instance_group [
{
count: 8
kind: KIND_CPU
}
]
optimization {
execution_accelerators {
cpu_execution_accelerator : [{
name : "openvino"
}]
}
}
- Inference from Clients are using cudshm (These are code snippets not the entire code):
self.triton_client = grpcclient.InferenceServerClient(url=self.triton_server_ip, verbose=self.verbose)
self.input = [grpcclient.InferInput("input_image", (1, self.input_camera_height, self.input_camera_width, 3), "UINT8"),
grpcclient.InferInput("input_camera_id", (1, 1), "FP16")]
self.output = grpcclient.InferRequestedOutput("predictions")
...
# Create Output in Shared Memory and store shared memory handles
self.shm_op_handle = cudashm.create_shared_memory_region(f"output_data_{self.camera_id}",
self.output_byte_size, 0)
self.shm_ip_image_handle = cudashm.create_shared_memory_region(f"input_data_image_{self.camera_id}",
self.input_image_byte_size, 0)
self.shm_ip_camera_handle = cudashm.create_shared_memory_region(f"input_data_camera_{self.camera_id}",
self.input_camera_byte_size, 0)
# Register Output shared memory with Triton Server
self.triton_client.register_cuda_shared_memory(f"output_data_{self.camera_id}",
cudashm.get_raw_handle(self.shm_op_handle), 0,
self.output_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_image_{self.camera_id}",
cudashm.get_raw_handle(self.shm_ip_image_handle),
0, self.input_image_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_camera_{self.camera_id}",
cudashm.get_raw_handle(self.shm_ip_camera_handle),
0, self.input_camera_byte_size)
...
self.input[0].set_shared_memory(f"input_data_image_{self.camera_id}", self.input_image_byte_size)
self.input[1].set_shared_memory(f"input_data_camera_{self.camera_id}", self.input_camera_byte_size)
self.output.set_shared_memory(f"output_data_{self.camera_id}", self.output_byte_size)
...
# Set CUDA Shared memory
cudashm.set_shared_memory_region(self.shm_ip_image_handle, [frame])
cudashm.set_shared_memory_region(self.shm_ip_camera_handle, [np.expand_dims(np.array(self.camera_id), 0).astype(np.float16)])
# Inference with server
results = self.triton_client.infer(model_name="detection_ensemble_model", inputs=[self.input[0], self.input[1]],
outputs=[self.output], client_timeout=self.client_timeout)
Expected behavior
Each model produces an output with batch_size=1 since I have multiple Triton clients each sends a single image at once, I expect if I put Triton models of pre-processing and post-processing should be faster with dynamic batching, I expect Triton to concatenate along the batch dimension, for example I got 10 requests at the same time, each with shape (1, 3, 768, 1280) I expect Triton to batch them as (10, 3, 768, 1280) and process them all at once, but I get a VERY LOW FPS instead. It looks like it is still being processed sequentially instead of being batched!
opened [DLIS-6702]
@AWallyAllah By any chance are you using PyTorch in your Python model? Could you share the code for model A?
@AWallyAllah Could you please share the model file with us so that we can further investigate?
Closing due to lack of activity. Please re-open the issue and provide the model files if you would like to follow up with this issue.