server Inference in Triton ensemble model is much slower than single model in Triton

trafficstars

Description

I'm using Triton Server ensemble model for several models connected to each other. Let's say [Model A, Model B, Model C, Model D]. The ensemble model takes input image and pass it sequentially to this pipeline (e.g. Model A then Model B then Model C then Model D in order). Only one model is deep learning model (runs on GPU) which is model B, and the other three models are ran on CPU (Model A, Model C and Model D. Typically pre-processing and post-processing models). I use:

Dynamic Batching: Each model produces a single-batch image (1, 3, w, h), but I have multiple clients connecting to Triton.
Ragged tensors (Model C) produces a variable number of detections.
Tensorrt accelerator for GPU model (GPU utilization from metrics: 0.16)
OpenVino accelerator for CPU models (CPU utilization from metrics: 0.997)

It looks like the CPU and GPU utilization is good, except it is not batching correctly! I get a VERY LOW FPS while inferring in Triton compared to outside Triton! For instance, If I set only Model B (Deep Learning model) and I get the pre-processing along with the post-processing outside Triton it performs much better (~25 FPS). But I get (~6 FPS) If I do ensemble model with pre-processing and post-processing is in Triton as ensemble model.

Triton Information

What version of Triton are you using?

nvcr.io/nvidia/tritonserver:24.04-py3

Are you using the Triton container or did you build it yourself?

Trtiton Container

To Reproduce

Model A Config.pbtxt

name: "ModelA"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_UINT8
    dims: [ 1520, 2688, 3 ]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ 3, 768, 1280 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Model B Config.pbtxt:

name: "ModelB"
backend: "tensorrt"
max_batch_size: 64

input [
  {
    name: "input0"
    data_type: TYPE_FP16
    dims:  [3 , 768, 1280]
  }]
output [
  {
    name: "output0"
    data_type: TYPE_FP16
    dims: [61200, 8]
  }
]

dynamic_batching {
}

instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
]

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "4294967296" }
      parameters { key: "trt_engine_cache_enable" value: "1" }
    }]
  }
}

Model C Config.pbtxt

name: "ModelC"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [61200, 8]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Model D Config.pbtxt

name: "ModelD"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
    allow_ragged_batch: true
}
]

batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "detection_bytetracker_input"
  }
]

input [
{
    name: "input1"
    data_type: TYPE_FP16
    dims: [ 1 ]
}
]

output [
{
    name: "input2"
    data_type: TYPE_FP16
    dims: [ -1, 7 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}

Inference from Clients are using cudshm (These are code snippets not the entire code):

self.triton_client = grpcclient.InferenceServerClient(url=self.triton_server_ip, verbose=self.verbose)
self.input = [grpcclient.InferInput("input_image", (1, self.input_camera_height, self.input_camera_width, 3), "UINT8"),
              grpcclient.InferInput("input_camera_id", (1, 1), "FP16")]
self.output = grpcclient.InferRequestedOutput("predictions")

...

# Create Output in Shared Memory and store shared memory handles
self.shm_op_handle = cudashm.create_shared_memory_region(f"output_data_{self.camera_id}",
                                                         self.output_byte_size, 0)
self.shm_ip_image_handle = cudashm.create_shared_memory_region(f"input_data_image_{self.camera_id}",
                                                         self.input_image_byte_size, 0)
self.shm_ip_camera_handle = cudashm.create_shared_memory_region(f"input_data_camera_{self.camera_id}",
                                                         self.input_camera_byte_size, 0)
# Register Output shared memory with Triton Server
self.triton_client.register_cuda_shared_memory(f"output_data_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_op_handle), 0,
                                               self.output_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_image_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_image_handle),
                                               0, self.input_image_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_camera_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_camera_handle),
                                               0, self.input_camera_byte_size)

...

self.input[0].set_shared_memory(f"input_data_image_{self.camera_id}", self.input_image_byte_size)
self.input[1].set_shared_memory(f"input_data_camera_{self.camera_id}", self.input_camera_byte_size)
self.output.set_shared_memory(f"output_data_{self.camera_id}", self.output_byte_size)

...

# Set CUDA Shared memory
cudashm.set_shared_memory_region(self.shm_ip_image_handle, [frame])
cudashm.set_shared_memory_region(self.shm_ip_camera_handle, [np.expand_dims(np.array(self.camera_id), 0).astype(np.float16)])

# Inference with server
results = self.triton_client.infer(model_name="detection_ensemble_model", inputs=[self.input[0], self.input[1]],
                                   outputs=[self.output], client_timeout=self.client_timeout)

Expected behavior

Each model produces an output with batch_size=1 since I have multiple Triton clients each sends a single image at once, I expect if I put Triton models of pre-processing and post-processing should be faster with dynamic batching, I expect Triton to concatenate along the batch dimension, for example I got 10 requests at the same time, each with shape (1, 3, 768, 1280) I expect Triton to batch them as (10, 3, 768, 1280) and process them all at once, but I get a VERY LOW FPS instead. It looks like it is still being processed sequentially instead of being batched!

May 14 '24 06:05 AWallyAllah

opened [DLIS-6702]

May 15 '24 02:05 statiraju

@AWallyAllah By any chance are you using PyTorch in your Python model? Could you share the code for model A?

May 15 '24 15:05 Tabrizian

@AWallyAllah Could you please share the model file with us so that we can further investigate?

Jun 07 '24 18:06 krishung5

Closing due to lack of activity. Please re-open the issue and provide the model files if you would like to follow up with this issue.

Jun 17 '24 20:06 krishung5

server server copied to clipboard

Inference in Triton ensemble model is much slower than single model in Triton

server
server copied to clipboard