server Tritonbackend multiple instances of the same model run sequentially instead of in parallel on the same device with asyncrhonous requests.

Description

We'd like to implement Ensemble modeling to run inference with the following workflow :

Untitled Diagram-Page-8 We managed to run both "construction" and "civil" model in parallel. However we've noticed that each model instance only took about 1GB of GPU Memory. Which is why we've increased the number of model instances for each specific model in order to increase throughput. However we've noticed that although tritonbackend creates multiple instances of the same model, they do not run in parallel but rather sequentially, which causes no change or augmentation in throughput.

Triton Information We're using NGC container 22.04

System Information AWS Instance : g4dn.xlarge

To Reproduce

Here is an example of the results we've obtained using "construction" model which is an efficientdet based object detection model with a static batch-size = 10. The configuration file we're using :

  platform: "tensorrt_plan"
  max_batch_size: 0

  input [
    {
      name: "stack_Concat__898:0"
      data_type: TYPE_FP16
      dims: [ 10,512,1280,3 ]
    }
  ]
  output [
    {
      name: "detection_boxes"
      data_type: TYPE_FP16
      dims: [ 10,100,4 ]
    },
    {
      name: "detection_classes"
      data_type: TYPE_INT32
      dims: [10,100 ]
    },    
    {
      name: "detection_scores"
      data_type: TYPE_FP16
      dims: [10,100]
    },    
    {
      name: "num_detections"
      data_type: TYPE_INT32
      dims: [10,-1]
      }
  ]
  instance_group [
    {
      count: 2
      kind: KIND_GPU
      gpus: [ 0 ]
    }
  ]

model_warmup [{
name : "image sample"
  batch_size: 0
  inputs {
    key: "stack_Concat__898:0"
    value: {
      data_type: TYPE_FP16
      dims: [10, 512, 1280, 3 ]
      random_data: true
    }
   }
}]

Command used to run the docker container :

docker run --rm -it --gpus all --name triton --ipc=host -p8000:8000 -p8001:8001 -p8002:8002 --shm-size=8gb --ulimit memlock=-1 --ulimit stack=568435456 --user root:root  -v $(pwd):/server -t triton

Command used to run triton server :

tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size  --http-thread-count 10

Output with instance count = 1 in construction model config file :

root@50208ced5836:/server# tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size 548843545 --http-thread-count 10
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
I0517 12:35:10.705208 574 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0517 12:35:10.705272 574 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.705283 574 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2022-05-17 12:35:10.887381: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0517 12:35:10.928375 574 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0517 12:35:10.928427 574 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.928436 574 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0517 12:35:10.928444 574 tensorflow.cc:2221] backend configuration:
{}
I0517 12:35:10.930059 574 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0517 12:35:10.930085 574 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.930095 574 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0517 12:35:10.930102 574 onnxruntime.cc:2446] backend configuration:
{}
I0517 12:35:10.948213 574 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0517 12:35:10.948244 574 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0517 12:35:10.948254 574 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
I0517 12:35:12.997021 574 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7efb62000000' with size 548843545
I0517 12:35:12.997448 574 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 268435456
I0517 12:35:12.999575 574 model_repository_manager.cc:1077] loading: construction:1
I0517 12:35:13.101033 574 tensorrt.cc:5294] TRITONBACKEND_Initialize: tensorrt
I0517 12:35:13.101069 574 tensorrt.cc:5304] Triton TRITONBACKEND API version: 1.9
I0517 12:35:13.101079 574 tensorrt.cc:5310] 'tensorrt' TRITONBACKEND API version: 1.9
I0517 12:35:13.101155 574 tensorrt.cc:5353] backend configuration:
{}
I0517 12:35:13.101182 574 tensorrt.cc:5405] TRITONBACKEND_ModelInitialize: construction (version 1)
I0517 12:35:13.102603 574 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0 (GPU device 0)
I0517 12:35:13.625109 574 logging.cc:49] [MemUsageChange] Init CUDA: CPU +471, GPU +0, now: CPU 2156, GPU 958 (MiB)
I0517 12:35:13.758068 574 logging.cc:49] Loaded engine size: 89 MiB
I0517 12:35:14.731365 574 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +786, GPU +224, now: CPU 3156, GPU 1272 (MiB)
I0517 12:35:14.930925 574 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +162, GPU +52, now: CPU 3318, GPU 1324 (MiB)
I0517 12:35:14.933229 574 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +83, now: CPU 0, GPU 83 (MiB)
I0517 12:35:14.939696 574 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 3140, GPU 1316 (MiB)
I0517 12:35:14.941363 574 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 1324 (MiB)
I0517 12:35:14.982711 574 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +856, now: CPU 1, GPU 939 (MiB)
I0517 12:35:14.983374 574 tensorrt.cc:1411] Created instance construction_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:35:14.984087 574 model_repository_manager.cc:1231] successfully loaded 'construction' version 1
I0517 12:35:14.984262 574 server.cc:549] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0517 12:35:14.984423 574 server.cc:576] 
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so               | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0517 12:35:14.984474 574 server.cc:619] 
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| construction | 1       | READY  |
+--------------+---------+--------+

I0517 12:35:14.996284 574 metrics.cc:650] Collecting metrics for GPU 0: Tesla T4
I0517 12:35:14.996586 574 tritonserver.cc:2123] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.21.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | triton-models                                                                                                                                                                                |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                |
| startup_models_0                 | construction                                                                                                                                                                                 |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 548843545                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 268435456                                                                                                                                                                                    |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0517 12:35:14.999922 574 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0517 12:35:15.000238 574 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0517 12:35:15.067631 574 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002

Output with instance count = 2 in construction model config file :

root@50208ced5836:/server# tritonserver --model-repository=triton-models --model-control-mode=explicit --load-model construction --cuda-memory-pool-byte-size 0:268435456 --pinned-memory-pool-byte-size 548843545 --http-thread-count 10
Warning: LBR backtrace method is not supported on this platform. DWARF backtrace method will be used.
I0517 12:55:04.150003 776 libtorch.cc:1381] TRITONBACKEND_Initialize: pytorch
I0517 12:55:04.150060 776 libtorch.cc:1391] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.150067 776 libtorch.cc:1397] 'pytorch' TRITONBACKEND API version: 1.9
2022-05-17 12:55:04.331099: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0517 12:55:04.371862 776 tensorflow.cc:2181] TRITONBACKEND_Initialize: tensorflow
I0517 12:55:04.371894 776 tensorflow.cc:2191] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.371901 776 tensorflow.cc:2197] 'tensorflow' TRITONBACKEND API version: 1.9
I0517 12:55:04.371906 776 tensorflow.cc:2221] backend configuration:
{}
I0517 12:55:04.373518 776 onnxruntime.cc:2400] TRITONBACKEND_Initialize: onnxruntime
I0517 12:55:04.373542 776 onnxruntime.cc:2410] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.373549 776 onnxruntime.cc:2416] 'onnxruntime' TRITONBACKEND API version: 1.9
I0517 12:55:04.373554 776 onnxruntime.cc:2446] backend configuration:
{}
I0517 12:55:04.390901 776 openvino.cc:1207] TRITONBACKEND_Initialize: openvino
I0517 12:55:04.390923 776 openvino.cc:1217] Triton TRITONBACKEND API version: 1.9
I0517 12:55:04.390929 776 openvino.cc:1223] 'openvino' TRITONBACKEND API version: 1.9
I0517 12:55:06.398005 776 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f5556000000' with size 548843545
I0517 12:55:06.398473 776 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 268435456
I0517 12:55:06.400672 776 model_repository_manager.cc:1077] loading: construction:1
I0517 12:55:06.502072 776 tensorrt.cc:5294] TRITONBACKEND_Initialize: tensorrt
I0517 12:55:06.502103 776 tensorrt.cc:5304] Triton TRITONBACKEND API version: 1.9
I0517 12:55:06.502114 776 tensorrt.cc:5310] 'tensorrt' TRITONBACKEND API version: 1.9
I0517 12:55:06.502196 776 tensorrt.cc:5353] backend configuration:
{}
I0517 12:55:06.502227 776 tensorrt.cc:5405] TRITONBACKEND_ModelInitialize: construction (version 1)
I0517 12:55:06.503616 776 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0_0 (GPU device 0)
I0517 12:55:07.031461 776 logging.cc:49] [MemUsageChange] Init CUDA: CPU +471, GPU +0, now: CPU 2156, GPU 958 (MiB)
I0517 12:55:07.165348 776 logging.cc:49] Loaded engine size: 89 MiB
I0517 12:55:08.127716 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +786, GPU +224, now: CPU 3156, GPU 1272 (MiB)
I0517 12:55:08.318249 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +162, GPU +52, now: CPU 3318, GPU 1324 (MiB)
I0517 12:55:08.320482 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +83, now: CPU 0, GPU 83 (MiB)
I0517 12:55:08.326906 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 3140, GPU 1316 (MiB)
I0517 12:55:08.328530 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 3140, GPU 1324 (MiB)
I0517 12:55:08.373881 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +1, GPU +856, now: CPU 1, GPU 939 (MiB)
I0517 12:55:08.374452 776 tensorrt.cc:1411] Created instance construction_0_0 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:55:08.375143 776 tensorrt.cc:5454] TRITONBACKEND_ModelInstanceInitialize: construction_0_1 (GPU device 0)
I0517 12:55:08.376915 776 logging.cc:49] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 3149, GPU 2230 (MiB)
I0517 12:55:08.377903 776 logging.cc:49] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 3149, GPU 2240 (MiB)
I0517 12:55:08.386167 776 logging.cc:49] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +857, now: CPU 1, GPU 1796 (MiB)
I0517 12:55:08.386634 776 tensorrt.cc:1411] Created instance construction_0_1 on GPU 0 with stream priority 0 and optimization profile default[0];
I0517 12:55:08.386809 776 model_repository_manager.cc:1231] successfully loaded 'construction' version 1
I0517 12:55:08.387147 776 server.cc:549] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0517 12:55:08.387221 776 server.cc:576] 
+-------------+-------------------------------------------------------------------------+--------+
| Backend     | Path                                                                    | Config |
+-------------+-------------------------------------------------------------------------+--------+
| pytorch     | /opt/tritonserver/backends/pytorch/libtriton_pytorch.so                 | {}     |
| tensorflow  | /opt/tritonserver/backends/tensorflow1/libtriton_tensorflow1.so         | {}     |
| onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so         | {}     |
| openvino    | /opt/tritonserver/backends/openvino_2021_4/libtriton_openvino_2021_4.so | {}     |
| tensorrt    | /opt/tritonserver/backends/tensorrt/libtriton_tensorrt.so               | {}     |
+-------------+-------------------------------------------------------------------------+--------+

I0517 12:55:08.387266 776 server.cc:619] 
+--------------+---------+--------+
| Model        | Version | Status |
+--------------+---------+--------+
| construction | 1       | READY  |
+--------------+---------+--------+

I0517 12:55:08.399238 776 metrics.cc:650] Collecting metrics for GPU 0: Tesla T4
I0517 12:55:08.399578 776 tritonserver.cc:2123] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.21.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | triton-models                                                                                                                                                                                |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                |
| startup_models_0                 | construction                                                                                                                                                                                 |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 548843545                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 268435456                                                                                                                                                                                    |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0517 12:55:08.403199 776 grpc_server.cc:4544] Started GRPCInferenceService at 0.0.0.0:8001
I0517 12:55:08.403457 776 http_server.cc:3242] Started HTTPService at 0.0.0.0:8000
I0517 12:55:08.467533 776 http_server.cc:180] Started Metrics Service at 0.0.0.0:8002

Steps to reproduce the behavior.

Script to send inference requests :

import numpy as np
import tritonclient.http as httpclient
import logging 
import time
logging.basicConfig(level=logging.INFO)
logging.getLogger("TritonClient").setLevel(logging.INFO)
log = logging.getLogger("TritonClient")

number_of_requests=10
requests_concurrency=10

triton_client = httpclient.InferenceServerClient(url="localhost:8000", verbose=False, concurrency=requests_concurrency )


model_name = "construction"
model_metadata = triton_client.get_model_metadata(model_name)

input_names = [str(i["name"]) for i in model_metadata["inputs"]]
output_names = [str(i["name"]) for i in model_metadata["outputs"]]
async_requests=[]


final_results =[]

# Loading dummy input for the model
image_data = np.ones([10,512,1280,3], np.float16)
inputs=[httpclient.InferInput(input_names[0], image_data.shape, "FP16")]
outputs=[httpclient.InferRequestedOutput(output) for output in output_names ]
inputs[0].set_data_from_numpy(image_data)


begining=time.time()
for j in range(number_of_requests):
    start=time.time()
    async_requests=[]             
    for i in range(requests_concurrency) :


        async_requests.append(triton_client.async_infer(model_name=model_name,
                                                            inputs=inputs,
                                                            outputs=outputs))


    for async_request in async_requests:
        final_results.append(async_request.get_result())
    log.info("Time taken to run {} asynchronous inference requests is {} ".format(requests_concurrency, time.time()-start))

log.info("Total time to run {} inference requests is {}".format(requests_concurrency*number_of_requests, time.time()-begining))

Output with instance count = 1 in construction model config file :

INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.992401599884033 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.018524408340454 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.034783124923706 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.047888994216919 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0408923625946045 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.076260805130005 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1126108169555664 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.134840965270996 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1259348392486572 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.1820130348205566 
INFO:TritonClient:Total time to run 100 inference requests is 31.767416954040527

Output with instance count = 2 in construction model config file :

INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.9438741207122803 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.020892381668091 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.042755603790283 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0679333209991455 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0309293270111084 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0512776374816895 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.068094491958618 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.0799334049224854 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.097849130630493 
INFO:TritonClient:Time taken to run 10 asynchronous inference requests is 3.078545570373535 
INFO:TritonClient:Total time to run 100 inference requests is 31.483280658721924

Expected behavior We've expected the throughput to double by increasing the instance count number of the model to 2. since the GPU utilization is not saturated. However there was no change throughput.

Number of model instances	GPU Utilization	Average time to run 10 asynchronous inference requests
1	2440MiB / 15109MiB	3.148s
2	3356MiB / 15109MiB	3.176s
4	5188MiB / 15109MiB	3.182s

May 17 '22 13:05 Abdellah-Laassairi

Hi @Abdellah-Laassairi, I think you need to specify instance_group field in each composing model to make it support parallel execution individually. Please refer to the Ensemble Model Instance Groups section.

May 18 '22 19:05 krishung5

Hello! We did specify the instance_group field for each composing model however that didn't change anything. In the example I've provided I've only loaded one model and changing the instance_group count of that model didn't change latency nor throughput.

May 19 '22 08:05 Abdellah-Laassairi

Any updates?

May 20 '22 08:05 Abdellah-Laassairi

Hi, could you monitor the GPU utilization while running both cases to see if the GPU is already fully utilized with the first instance?

May 20 '22 17:05 krishung5

Hello! GPU utilization was at 2.4GB while running case 1 (1 model instance) and at 3.3GB in case 2 (2 model instances), the available memory for a g4dn instance is about 16GB, so it's far from being fully utilized.

May 22 '22 21:05 Abdellah-Laassairi

The GPU utilization refers to whether the GPU is computing at full capacity, if you use nvidia-smi to monitor GPU status, you will notice that GPU utilization is a different field (the 2% one below) from the GPU memory usage

$ nvidia-smi
Mon May 23 10:16:44 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN V      On   | 00000000:65:00.0  On |                  N/A |
| 29%   43C    P2    32W / 250W |    680MiB / 12288MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

May 23 '22 17:05 GuanLuo

Hello, thanks for your answer, here's the output for both cases :

instance count = 1

Tue May 24 08:23:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   48C    P0    35W /  70W |   2240MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

instance count = 2

Tue May 24 08:28:54 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   47C    P0    69W /  70W |   3356MiB / 15109MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

May 24 '22 08:05 Abdellah-Laassairi

It looks like your model is so complex (or GPU has little compute resources) that one model already fully loads the compute resources of your GPU. On a bigger GPU with enough compute resources two (or more) models could run in parallel, but in your case the models have to be executed sequentially. Note that next to nvidia-smi you can use the nvidia nsight-systems tool (nsys) to get fine-grain info on how the two model inferences load your GPU.

May 24 '22 09:05 marchaarsma

Hello! I was already using nsys to trace cuda and cudnn. The output in the timeline when instance count = 1 is as I expect it to be. However It's a bit difficult to get a good understanding on how the two instances of the same model behave in nsys since both of them are executed on the same thread. In the mean time I'd try to use a simpler model to verify if the GPU compute resources is the bottleneck in this case or not.

May 24 '22 09:05 Abdellah-Laassairi

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

Nov 22 '22 03:11 jbkyang-nvi

I am having same issue, but with models running on CPU. Instead of running in parallel it seems that models excute in sequence.

Jan 02 '23 21:01 manastahir

@Abdellah-Laassairi @marchaarsma You can refer to this code auto execution_policy = TRITONBACKEND_EXECUTION_DEVICE_BLOCKING; ,My understanding is that multi-instance tensorrt backend will be set in blocking mode on one device, and parallel effects can be achieved on different devices.

Jan 03 '23 01:01 zhaohb

server server copied to clipboard

Tritonbackend multiple instances of the same model run sequentially instead of in parallel on the same device with asyncrhonous requests.

Description

To Reproduce

Steps to reproduce the behavior.

server
server copied to clipboard