server icon indicating copy to clipboard operation
server copied to clipboard

Inference in Triton ensemble model is much slower than single model in Triton

Open AWallyAllah opened this issue 1 year ago • 2 comments

Description

I'm using Triton Server ensemble model for several models connected to each other. Let's say [Model A, Model B, Model C, Model D]. The ensemble model takes input image and pass it sequentially to this pipeline (e.g. Model A then Model B then Model C then Model D in order). Only one model is deep learning model (runs on GPU) which is model B, and the other three models are ran on CPU (Model A, Model C and Model D. Typically pre-processing and post-processing models). I use:

  1. Dynamic Batching: Each model produces a single-batch image (1, 3, w, h), but I have multiple clients connecting to Triton.
  2. Ragged tensors (Model C) produces a variable number of detections.
  3. Tensorrt accelerator for GPU model (GPU utilization from metrics: 0.16)
  4. OpenVino accelerator for CPU models (CPU utilization from metrics: 0.997)

It looks like the CPU and GPU utilization is good, except it is not batching correctly! I get a VERY LOW FPS while inferring in Triton compared to outside Triton! For instance, If I set only Model B (Deep Learning model) and I get the pre-processing along with the post-processing outside Triton it performs much better (~25 FPS). But I get (~6 FPS) If I do ensemble model with pre-processing and post-processing is in Triton as ensemble model.

Triton Information

What version of Triton are you using?

nvcr.io/nvidia/tritonserver:24.04-py3

Are you using the Triton container or did you build it yourself?

Trtiton Container

To Reproduce

  1. Model A Config.pbtxt
name: "ModelA"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_UINT8
    dims: [ 1520, 2688, 3 ]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ 3, 768, 1280 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Model B Config.pbtxt:
name: "ModelB"
backend: "tensorrt"
max_batch_size: 64

input [
  {
    name: "input0"
    data_type: TYPE_FP16
    dims:  [3 , 768, 1280]
  }]
output [
  {
    name: "output0"
    data_type: TYPE_FP16
    dims: [61200, 8]
  }
]

dynamic_batching {
}

instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
]

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
      parameters { key: "max_workspace_size_bytes" value: "4294967296" }
      parameters { key: "trt_engine_cache_enable" value: "1" }
    }]
  }
}
  1. Model C Config.pbtxt
name: "ModelC"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [61200, 8]
}
]

output [
{
    name: "output0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Model D Config.pbtxt
name: "ModelD"
backend: "python"
max_batch_size: 64

input [
{
    name: "input0"
    data_type: TYPE_FP16
    dims: [ -1, 6 ]
    allow_ragged_batch: true
}
]

batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "INDEX"
    data_type: TYPE_FP32
    source_input: "detection_bytetracker_input"
  }
]

input [
{
    name: "input1"
    data_type: TYPE_FP16
    dims: [ 1 ]
}
]

output [
{
    name: "input2"
    data_type: TYPE_FP16
    dims: [ -1, 7 ]
}
]

dynamic_batching {
}

instance_group [
    {
        count: 8
        kind: KIND_CPU
    }
]

optimization {
  execution_accelerators {
    cpu_execution_accelerator : [{
      name : "openvino"
    }]
  }
}
  1. Inference from Clients are using cudshm (These are code snippets not the entire code):
self.triton_client = grpcclient.InferenceServerClient(url=self.triton_server_ip, verbose=self.verbose)
self.input = [grpcclient.InferInput("input_image", (1, self.input_camera_height, self.input_camera_width, 3), "UINT8"),
              grpcclient.InferInput("input_camera_id", (1, 1), "FP16")]
self.output = grpcclient.InferRequestedOutput("predictions")

...

# Create Output in Shared Memory and store shared memory handles
self.shm_op_handle = cudashm.create_shared_memory_region(f"output_data_{self.camera_id}",
                                                         self.output_byte_size, 0)
self.shm_ip_image_handle = cudashm.create_shared_memory_region(f"input_data_image_{self.camera_id}",
                                                         self.input_image_byte_size, 0)
self.shm_ip_camera_handle = cudashm.create_shared_memory_region(f"input_data_camera_{self.camera_id}",
                                                         self.input_camera_byte_size, 0)
# Register Output shared memory with Triton Server
self.triton_client.register_cuda_shared_memory(f"output_data_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_op_handle), 0,
                                               self.output_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_image_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_image_handle),
                                               0, self.input_image_byte_size)
self.triton_client.register_cuda_shared_memory(f"input_data_camera_{self.camera_id}",
                                               cudashm.get_raw_handle(self.shm_ip_camera_handle),
                                               0, self.input_camera_byte_size)

...

self.input[0].set_shared_memory(f"input_data_image_{self.camera_id}", self.input_image_byte_size)
self.input[1].set_shared_memory(f"input_data_camera_{self.camera_id}", self.input_camera_byte_size)
self.output.set_shared_memory(f"output_data_{self.camera_id}", self.output_byte_size)

...

# Set CUDA Shared memory
cudashm.set_shared_memory_region(self.shm_ip_image_handle, [frame])
cudashm.set_shared_memory_region(self.shm_ip_camera_handle, [np.expand_dims(np.array(self.camera_id), 0).astype(np.float16)])

# Inference with server
results = self.triton_client.infer(model_name="detection_ensemble_model", inputs=[self.input[0], self.input[1]],
                                   outputs=[self.output], client_timeout=self.client_timeout)

Expected behavior

Each model produces an output with batch_size=1 since I have multiple Triton clients each sends a single image at once, I expect if I put Triton models of pre-processing and post-processing should be faster with dynamic batching, I expect Triton to concatenate along the batch dimension, for example I got 10 requests at the same time, each with shape (1, 3, 768, 1280) I expect Triton to batch them as (10, 3, 768, 1280) and process them all at once, but I get a VERY LOW FPS instead. It looks like it is still being processed sequentially instead of being batched!

AWallyAllah avatar May 14 '24 06:05 AWallyAllah

opened [DLIS-6702]

statiraju avatar May 15 '24 02:05 statiraju

@AWallyAllah By any chance are you using PyTorch in your Python model? Could you share the code for model A?

Tabrizian avatar May 15 '24 15:05 Tabrizian

@AWallyAllah Could you please share the model file with us so that we can further investigate?

krishung5 avatar Jun 07 '24 18:06 krishung5

Closing due to lack of activity. Please re-open the issue and provide the model files if you would like to follow up with this issue.

krishung5 avatar Jun 17 '24 20:06 krishung5